Computer Vision and Pattern Recognition 124
☆ DynamicCity: Large-Scale LiDAR Generation from Dynamic Scenes
LiDAR scene generation has been developing rapidly recently. However,
existing methods primarily focus on generating static and single-frame scenes,
overlooking the inherently dynamic nature of real-world driving environments.
In this work, we introduce DynamicCity, a novel 4D LiDAR generation framework
capable of generating large-scale, high-quality LiDAR scenes that capture the
temporal evolution of dynamic environments. DynamicCity mainly consists of two
key models. 1) A VAE model for learning HexPlane as the compact 4D
representation. Instead of using naive averaging operations, DynamicCity
employs a novel Projection Module to effectively compress 4D LiDAR features
into six 2D feature maps for HexPlane construction, which significantly
enhances HexPlane fitting quality (up to 12.56 mIoU gain). Furthermore, we
utilize an Expansion & Squeeze Strategy to reconstruct 3D feature volumes in
parallel, which improves both network training efficiency and reconstruction
accuracy than naively querying each 3D point (up to 7.05 mIoU gain, 2.06x
training speedup, and 70.84% memory reduction). 2) A DiT-based diffusion model
for HexPlane generation. To make HexPlane feasible for DiT generation, a Padded
Rollout Operation is proposed to reorganize all six feature planes of the
HexPlane as a squared 2D feature map. In particular, various conditions could
be introduced in the diffusion or sampling process, supporting versatile 4D
generation applications, such as trajectory- and command-driven generation,
inpainting, and layout-conditioned generation. Extensive experiments on the
CarlaSC and Waymo datasets demonstrate that DynamicCity significantly
outperforms existing state-of-the-art 4D LiDAR generation methods across
multiple metrics. The code will be released to facilitate future research.
comment: Preprint; 29 pages, 15 figures, 7 tables; Project Page at
https://dynamic-city.github.io/
☆ FIPER: Generalizable Factorized Fields for Joint Image Compression and Super-Resolution
In this work, we propose a unified representation for Super-Resolution (SR)
and Image Compression, termed **Factorized Fields**, motivated by the shared
principles between these two tasks. Both SISR and Image Compression require
recovering and preserving fine image details--whether by enhancing resolution
or reconstructing compressed data. Unlike previous methods that mainly focus on
network architecture, our proposed approach utilizes a basis-coefficient
decomposition to explicitly capture multi-scale visual features and structural
components in images, addressing the core challenges of both tasks. We first
derive our SR model, which includes a Coefficient Backbone and Basis Swin
Transformer for generalizable Factorized Fields. Then, to further unify these
two tasks, we leverage the strong information-recovery capabilities of the
trained SR modules as priors in the compression pipeline, improving both
compression efficiency and detail reconstruction. Additionally, we introduce a
merged-basis compression branch that consolidates shared structures, further
optimizing the compression process. Extensive experiments show that our unified
representation delivers state-of-the-art performance, achieving an average
relative improvement of 204.4% in PSNR over the baseline in Super-Resolution
(SR) and 9.35% BD-rate reduction in Image Compression compared to the previous
SOTA.
comment: Project page: https://jayisaking.github.io/FIPER/
☆ FreeVS: Generative View Synthesis on Free Driving Trajectory
Existing reconstruction-based novel view synthesis methods for driving scenes
focus on synthesizing camera views along the recorded trajectory of the ego
vehicle. Their image rendering performance will severely degrade on viewpoints
falling out of the recorded trajectory, where camera rays are untrained. We
propose FreeVS, a novel fully generative approach that can synthesize camera
views on free new trajectories in real driving scenes. To control the
generation results to be 3D consistent with the real scenes and accurate in
viewpoint pose, we propose the pseudo-image representation of view priors to
control the generation process. Viewpoint transformation simulation is applied
on pseudo-images to simulate camera movement in each direction. Once trained,
FreeVS can be applied to any validation sequences without reconstruction
process and synthesis views on novel trajectories. Moreover, we propose two new
challenging benchmarks tailored to driving scenes, which are novel camera
synthesis and novel trajectory synthesis, emphasizing the freedom of
viewpoints. Given that no ground truth images are available on novel
trajectories, we also propose to evaluate the consistency of images synthesized
on novel trajectories with 3D perception models. Experiments on the Waymo Open
Dataset show that FreeVS has a strong image synthesis performance on both the
recorded trajectories and novel trajectories. Project Page:
https://freevs24.github.io/
comment: Project Page: https://freevs24.github.io/
☆ UnCLe: Unsupervised Continual Learning of Depth Completion
We propose UnCLe, a standardized benchmark for Unsupervised Continual
Learning of a multimodal depth estimation task: Depth completion aims to infer
a dense depth map from a pair of synchronized RGB image and sparse depth map.
We benchmark depth completion models under the practical scenario of
unsupervised learning over continuous streams of data. Existing methods are
typically trained on a static, or stationary, dataset. However, when adapting
to novel non-stationary distributions, they "catastrophically forget"
previously learned information. UnCLe simulates these non-stationary
distributions by adapting depth completion models to sequences of datasets
containing diverse scenes captured from distinct domains using different visual
and range sensors. We adopt representative methods from continual learning
paradigms and translate them to enable unsupervised continual learning of depth
completion. We benchmark these models for indoor and outdoor and investigate
the degree of catastrophic forgetting through standard quantitative metrics.
Furthermore, we introduce model inversion quality as an additional measure of
forgetting. We find that unsupervised continual learning of depth completion is
an open problem, and we invite researchers to leverage UnCLe as a development
platform.
comment: Preprint
☆ WorldSimBench: Towards Video Generation Models as World Simulators
Yiran Qin, Zhelun Shi, Jiwen Yu, Xijun Wang, Enshen Zhou, Lijun Li, Zhenfei Yin, Xihui Liu, Lu Sheng, Jing Shao, Lei Bai, Wanli Ouyang, Ruimao Zhang
Recent advancements in predictive models have demonstrated exceptional
capabilities in predicting the future state of objects and scenes. However, the
lack of categorization based on inherent characteristics continues to hinder
the progress of predictive model development. Additionally, existing benchmarks
are unable to effectively evaluate higher-capability, highly embodied
predictive models from an embodied perspective. In this work, we classify the
functionalities of predictive models into a hierarchy and take the first step
in evaluating World Simulators by proposing a dual evaluation framework called
WorldSimBench. WorldSimBench includes Explicit Perceptual Evaluation and
Implicit Manipulative Evaluation, encompassing human preference assessments
from the visual perspective and action-level evaluations in embodied tasks,
covering three representative embodied scenarios: Open-Ended Embodied
Environment, Autonomous, Driving, and Robot Manipulation. In the Explicit
Perceptual Evaluation, we introduce the HF-Embodied Dataset, a video assessment
dataset based on fine-grained human feedback, which we use to train a Human
Preference Evaluator that aligns with human perception and explicitly assesses
the visual fidelity of World Simulators. In the Implicit Manipulative
Evaluation, we assess the video-action consistency of World Simulators by
evaluating whether the generated situation-aware video can be accurately
translated into the correct control signals in dynamic environments. Our
comprehensive evaluation offers key insights that can drive further innovation
in video generation models, positioning World Simulators as a pivotal
advancement toward embodied artificial intelligence.
★ TP-Eval: Tap Multimodal LLMs' Potential in Evaluation by Customizing Prompts
Recently, multimodal large language models (MLLMs) have received much
attention for their impressive capabilities. The evaluation of MLLMs is
becoming critical to analyzing attributes of MLLMs and providing valuable
insights. However, current benchmarks overlook the problem of prompt
sensitivity - minor prompt variations may lead to significant performance
fluctuations. Thus, inappropriate prompts may obscure the models' capabilities,
underestimating the models' performance. Moreover, different models have
different preferences for different prompts, and thus, using the same prompt
for all models will cause evaluation bias. This paper analyzes this deficiency
in existing benchmarks and further introduces a new evaluation framework named
TP-Eval, which introduces a prompt customization method to reduce evaluation
biases and tap models' potential. TP-Eval will rewrite the original prompts to
different customized prompts for different models. In particular, we propose
some well-designed modules for prompt customization tailored to the scenario of
MLLM evaluation. Extensive experiments demonstrate the effectiveness of our
approach to uncovering models' capabilities, and TP-Eval should benefit the
community in developing more comprehensive and convincing MLLM evaluation
benchmarks.
☆ SPIRE: Synergistic Planning, Imitation, and Reinforcement Learning for Long-Horizon Manipulation
Robot learning has proven to be a general and effective technique for
programming manipulators. Imitation learning is able to teach robots solely
from human demonstrations but is bottlenecked by the capabilities of the
demonstrations. Reinforcement learning uses exploration to discover better
behaviors; however, the space of possible improvements can be too large to
start from scratch. And for both techniques, the learning difficulty increases
proportional to the length of the manipulation task. Accounting for this, we
propose SPIRE, a system that first uses Task and Motion Planning (TAMP) to
decompose tasks into smaller learning subproblems and second combines imitation
and reinforcement learning to maximize their strengths. We develop novel
strategies to train learning agents when deployed in the context of a planning
system. We evaluate SPIRE on a suite of long-horizon and contact-rich robot
manipulation problems. We find that SPIRE outperforms prior approaches that
integrate imitation learning, reinforcement learning, and planning by 35% to
50% in average task performance, is 6 times more data efficient in the number
of human demonstrations needed to train proficient agents, and learns to
complete tasks nearly twice as efficiently. View
https://sites.google.com/view/spire-corl-2024 for more details.
comment: Conference on Robot Learning (CoRL) 2024
☆ CLEAR: Character Unlearning in Textual and Visual Modalities
Alexey Dontsov, Dmitrii Korzh, Alexey Zhavoronkin, Boris Mikheev, Denis Bobkov, Aibek Alanov, Oleg Y. Rogov, Ivan Oseledets, Elena Tutubalina
Machine Unlearning (MU) is critical for enhancing privacy and security in
deep learning models, particularly in large multimodal language models (MLLMs),
by removing specific private or hazardous information. While MU has made
significant progress in textual and visual modalities, multimodal unlearning
(MMU) remains significantly underexplored, partially due to the absence of a
suitable open-source benchmark. To address this, we introduce CLEAR, a new
benchmark designed to evaluate MMU methods. CLEAR contains 200 fictitious
individuals and 3,700 images linked with corresponding question-answer pairs,
enabling a thorough evaluation across modalities. We assess 10 MU methods,
adapting them for MMU, and highlight new challenges specific to multimodal
forgetting. We also demonstrate that simple $\ell_1$ regularization on LoRA
weights significantly mitigates catastrophic forgetting, preserving model
performance on retained data. The dataset is available at
https://huggingface.co/datasets/therem/CLEAR
☆ In-Pixel Foreground and Contrast Enhancement Circuits with Customizable Mapping
This paper presents an innovative in-pixel contrast enhancement circuit that
performs image processing directly within the pixel circuit. The circuit can be
tuned for different modes of operation. In foreground enhancement mode, it
suppresses low-intensity background pixels to nearly zero, isolating the
foreground for better object visibility. In contrast enhancement mode, it
improves overall image contrast. The contrast enhancement function is
customizable both during the design phase and in real-time, allowing the
circuit to adapt to specific applications and varying lighting conditions. A
model of the designed pixel circuit is developed and applied to a full pixel
array, demonstrating significant improvements in image quality. Simulations
performed in HSPICE show a nearly 6x increase in Michelson Contrast Ratio (CR)
in the foreground enhancement mode. The simulation results indicate its
potential for real-time, adaptive contrast enhancement across various imaging
environments.
☆ Real time anomalies detection on video
Nowadays, many places use security cameras. Unfortunately, when an incident
occurs, these technologies are used to show past events. So it can be
considered as a deterrence tool than a detection tool. In this article, we will
propose a deep learning approach trying to solve this problematic. This
approach uses convolutional models (CNN) to extract relevant characteristics
linked to the video images, theses characteristics will form times series to be
analyzed by LSTM / GRU models.
☆ Scalable Ranked Preference Optimization for Text-to-Image Generation
Direct Preference Optimization (DPO) has emerged as a powerful approach to
align text-to-image (T2I) models with human feedback. Unfortunately, successful
application of DPO to T2I models requires a huge amount of resources to collect
and label large-scale datasets, e.g., millions of generated paired images
annotated with human preferences. In addition, these human preference datasets
can get outdated quickly as the rapid improvements of T2I models lead to higher
quality images. In this work, we investigate a scalable approach for collecting
large-scale and fully synthetic datasets for DPO training. Specifically, the
preferences for paired images are generated using a pre-trained reward
function, eliminating the need for involving humans in the annotation process,
greatly improving the dataset collection efficiency. Moreover, we demonstrate
that such datasets allow averaging predictions across multiple models and
collecting ranked preferences as opposed to pairwise preferences. Furthermore,
we introduce RankDPO to enhance DPO-based methods using the ranking feedback.
Applying RankDPO on SDXL and SD3-Medium models with our synthetically generated
preference dataset ``Syn-Pic'' improves both prompt-following (on benchmarks
like T2I-Compbench, GenEval, and DPG-Bench) and visual quality (through user
studies). This pipeline presents a practical and scalable solution to develop
better preference datasets to enhance the performance of text-to-image models.
comment: Project Page: https://snap-research.github.io/RankDPO/
☆ Characterization of the multiplicity of solutions for camera pose given two vertically-aligned landmarks and accelerometer
We consider the problem of recovering the position and orientation of a
camera equipped with an accelerometer from sensor images of two labeled
landmarks whose positions in a coordinate system aligned in a known way with
gravity are known. This a variant on the much studied P$n$P problem of
recovering camera position and orientation from $n$ points without any
gravitational data. It is proved that in three types of singular cases there
are infinitely many solutions, in another type of case there is one, and in a
final type of case there are two. A precise characterization of each type of
case. In particular, there is always a unique solution in the practically
interesting case where the two landmarks are at the same altitude and the
camera is at a different altitude. This case is studied by numerical simulation
and an implementation on a consumer cellphone. It is also proved that if the
two landmarks are unlabeled, then apart from the same singular cases, there are
still always one or two solutions.
comment: 32 pages, 8 figures
☆ A Pipeline for Segmenting and Structuring RGB-D Data for Robotics Applications
We introduce a novel pipeline for segmenting and structuring color and depth
(RGB-D) data. Existing processing pipelines for RGB-D data have focused on
extracting geometric information alone. This approach precludes the development
of more advanced robotic navigation and manipulation algorithms, which benefit
from a semantic understanding of their environment. Our pipeline can segment
RGB-D data into accurate semantic masks. These masks are then used to fuse raw
captured point clouds into semantically separated point clouds. We store this
information using the Universal Scene Description (USD) file format, a format
suitable for easy querying by downstream robotics algorithms, human-friendly
visualization, and robotics simulation.
☆ Robust Two-View Geometry Estimation with Implicit Differentiation IROS 2024
We present a novel two-view geometry estimation framework which is based on a
differentiable robust loss function fitting. We propose to treat the robust
fundamental matrix estimation as an implicit layer, which allows us to avoid
backpropagation through time and significantly improves the numerical
stability. To take full advantage of the information from the feature matching
stage we incorporate learnable weights that depend on the matching confidences.
In this way our solution brings together feature extraction, matching and
two-view geometry estimation in a unified end-to-end trainable pipeline. We
evaluate our approach on the camera pose estimation task in both outdoor and
indoor scenarios. The experiments on several datasets show that the proposed
method outperforms both classic and learning-based state-of-the-art methods by
a large margin. The project webpage is available at:
https://github.com/VladPyatov/ihls
comment: IROS 2024 Accepted
☆ A Wavelet Diffusion GAN for Image Super-Resolution
In recent years, diffusion models have emerged as a superior alternative to
generative adversarial networks (GANs) for high-fidelity image generation, with
wide applications in text-to-image generation, image-to-image translation, and
super-resolution. However, their real-time feasibility is hindered by slow
training and inference speeds. This study addresses this challenge by proposing
a wavelet-based conditional Diffusion GAN scheme for Single-Image
Super-Resolution (SISR). Our approach utilizes the diffusion GAN paradigm to
reduce the timesteps required by the reverse diffusion process and the Discrete
Wavelet Transform (DWT) to achieve dimensionality reduction, decreasing
training and inference times significantly. The results of an experimental
validation on the CelebA-HQ dataset confirm the effectiveness of our proposed
scheme. Our approach outperforms other state-of-the-art methodologies
successfully ensuring high-fidelity output while overcoming inherent drawbacks
associated with diffusion models in time-sensitive applications.
comment: The paper has been accepted at Italian Workshop on Neural Networks
(WIRN) 2024
☆ Medical Imaging Complexity and its Effects on GAN Performance ACCV
The proliferation of machine learning models in diverse clinical applications
has led to a growing need for high-fidelity, medical image training data. Such
data is often scarce due to cost constraints and privacy concerns. Alleviating
this burden, medical image synthesis via generative adversarial networks (GANs)
emerged as a powerful method for synthetically generating photo-realistic
images based on existing sets of real medical images. However, the exact image
set size required to efficiently train such a GAN is unclear. In this work, we
experimentally establish benchmarks that measure the relationship between a
sample dataset size and the fidelity of the generated images, given the
dataset's distribution of image complexities. We analyze statistical metrics
based on delentropy, an image complexity measure rooted in Shannon's entropy in
information theory. For our pipeline, we conduct experiments with two
state-of-the-art GANs, StyleGAN 3 and SPADE-GAN, trained on multiple medical
imaging datasets with variable sample sizes. Across both GANs, general
performance improved with increasing training set size but suffered with
increasing complexity.
comment: Accepted to ACCV, Workshop on Generative AI for Synthetic Medical
Data
☆ VR-Splatting: Foveated Radiance Field Rendering via 3D Gaussian Splatting and Neural Points
Recent advances in novel view synthesis (NVS), particularly neural radiance
fields (NeRF) and Gaussian splatting (3DGS), have demonstrated impressive
results in photorealistic scene rendering. These techniques hold great
potential for applications in virtual tourism and teleportation, where
immersive realism is crucial. However, the high-performance demands of virtual
reality (VR) systems present challenges in directly utilizing even such
fast-to-render scene representations like 3DGS due to latency and computational
constraints.
In this paper, we propose foveated rendering as a promising solution to these
obstacles. We analyze state-of-the-art NVS methods with respect to their
rendering performance and compatibility with the human visual system. Our
approach introduces a novel foveated rendering approach for Virtual Reality,
that leverages the sharp, detailed output of neural point rendering for the
foveal region, fused with a smooth rendering of 3DGS for the peripheral vision.
Our evaluation confirms that perceived sharpness and detail-richness are
increased by our approach compared to a standard VR-ready 3DGS configuration.
Our system meets the necessary performance requirements for real-time VR
interactions, ultimately enhancing the user's immersive experience.
Project page: https://lfranke.github.io/vr_splatting
☆ Gaze-Assisted Medical Image Segmentation NeurIPS'24
The annotation of patient organs is a crucial part of various diagnostic and
treatment procedures, such as radiotherapy planning. Manual annotation is
extremely time-consuming, while its automation using modern image analysis
techniques has not yet reached levels sufficient for clinical adoption. This
paper investigates the idea of semi-supervised medical image segmentation using
human gaze as interactive input for segmentation correction. In particular, we
fine-tuned the Segment Anything Model in Medical Images (MedSAM), a public
solution that uses various prompt types as additional input for semi-automated
segmentation correction. We used human gaze data from reading abdominal images
as a prompt for fine-tuning MedSAM. The model was validated on a public WORD
database, which consists of 120 CT scans of 16 abdominal organs. The results of
the gaze-assisted MedSAM were shown to be superior to the results of the
state-of-the-art segmentation models. In particular, the average Dice
coefficient for 16 abdominal organs was 85.8%, 86.7%, 81.7%, and 90.5% for
nnUNetV2, ResUNet, original MedSAM, and our gaze-assisted MedSAM model,
respectively.
comment: 16 pages, 4 figures, Accepted to AIM-FM Workshop @ NeurIPS'24
☆ Addressing Asynchronicity in Clinical Multimodal Fusion via Individualized Chest X-ray Generation NeurIPS-24
Integrating multi-modal clinical data, such as electronic health records
(EHR) and chest X-ray images (CXR), is particularly beneficial for clinical
prediction tasks. However, in a temporal setting, multi-modal data are often
inherently asynchronous. EHR can be continuously collected but CXR is generally
taken with a much longer interval due to its high cost and radiation dose. When
clinical prediction is needed, the last available CXR image might have been
outdated, leading to suboptimal predictions. To address this challenge, we
propose DDL-CXR, a method that dynamically generates an up-to-date latent
representation of the individualized CXR images. Our approach leverages latent
diffusion models for patient-specific generation strategically conditioned on a
previous CXR image and EHR time series, providing information regarding
anatomical structures and disease progressions, respectively. In this way, the
interaction across modalities could be better captured by the latent CXR
generation process, ultimately improving the prediction performance.
Experiments using MIMIC datasets show that the proposed model could effectively
address asynchronicity in multimodal fusion and consistently outperform
existing methods.
comment: Accepted by NeurIPS-24
☆ R-CoT: Reverse Chain-of-Thought Problem Generation for Geometric Reasoning in Large Multimodal Models
Linger Deng, Yuliang Liu, Bohan Li, Dongliang Luo, Liang Wu, Chengquan Zhang, Pengyuan Lyu, Ziyang Zhang, Gang Zhang, Errui Ding, Yingying Zhu, Xiang Bai
Existing Large Multimodal Models (LMMs) struggle with mathematical geometric
reasoning due to a lack of high-quality image-text paired data. Current
geometric data generation approaches, which apply preset templates to generate
geometric data or use Large Language Models (LLMs) to rephrase questions and
answers (Q&A), unavoidably limit data accuracy and diversity. To synthesize
higher-quality data, we propose a two-stage Reverse Chain-of-Thought (R-CoT)
geometry problem generation pipeline. First, we introduce GeoChain to produce
high-fidelity geometric images and corresponding descriptions highlighting
relations among geometric elements. We then design a Reverse A&Q method that
reasons step-by-step based on the descriptions and generates questions in
reverse from the reasoning results. Experiments demonstrate that the proposed
method brings significant and consistent improvements on multiple LMM
baselines, achieving new performance records in the 2B, 7B, and 8B settings.
Notably, R-CoT-8B significantly outperforms previous state-of-the-art
open-source mathematical models by 16.6% on MathVista and 9.2% on GeoQA, while
also surpassing the closed-source model GPT-4o by an average of 13% across both
datasets. The code is available at https://github.com/dle666/R-CoT.
☆ A utility-based spatial analysis of residential street-level conditions; A case study of Rotterdam
Residential location choices are traditionally modelled using factors related
to accessibility and socioeconomic environments, neglecting the importance of
local street-level conditions. Arguably, this neglect is due to data practices.
Today, however, street-level images -- which are highly effective at encoding
street-level conditions -- are widely available. Additionally, recent advances
in discrete choice models incorporating computer vision capabilities offer
opportunities to integrate street-level conditions into residential location
choice analysis. This study leverages these developments to investigate the
spatial distribution of utility derived from street-level conditions in
residential location choices on a city-wide scale. In our case study of
Rotterdam, the Netherlands, we find that the utility derived from street-level
conditions varies significantly on a highly localised scale, with conditions
rapidly changing even within neighbourhoods. Our results also reveal that the
high real-estate prices in the city centre cannot be attributed to attractive
street-level conditions. Furthermore, whereas the city centre is characterised
by relatively unattractive residential street-level conditions, neighbourhoods
in the southern part of the city -- often perceived as problematic -- exhibit
surprisingly appealing street-level environments. The methodological
contribution of this paper is that it advances the discrete choice models
incorporating computer vision capabilities by introducing a semantic
regularisation layer to the model. Thereby, it adds explainability and
eliminates the need for a separate pipeline to extract information from images,
streamlining the analysis. As such, this paper's findings and methodological
advancements pave the way for further studies to explore integrating
street-level conditions in urban planning.
☆ CASCRNet: An Atrous Spatial Pyramid Pooling and Shared Channel Residual based Network for Capsule Endoscopy
This manuscript summarizes work on the Capsule Vision Challenge 2024 by
MISAHUB. To address the multi-class disease classification task, which is
challenging due to the complexity and imbalance in the Capsule Vision challenge
dataset, this paper proposes CASCRNet (Capsule endoscopy-Aspp-SCR-Network), a
parameter-efficient and novel model that uses Shared Channel Residual (SCR)
blocks and Atrous Spatial Pyramid Pooling (ASPP) blocks. Further, the
performance of the proposed model is compared with other well-known approaches.
The experimental results yield that proposed model provides better disease
classification results. The proposed model was successful in classifying
diseases with an F1 Score of 78.5% and a Mean AUC of 98.3%, which is promising
given its compact architecture.
comment: 8 pages, 4 figures
☆ Blendify -- Python rendering framework for Blender
With the rapid growth of the volume of research fields like computer vision
and computer graphics, researchers require effective and user-friendly
rendering tools to visualize results. While advanced tools like Blender offer
powerful capabilities, they also require a significant effort to master. This
technical report introduces Blendify, a lightweight Python-based framework that
seamlessly integrates with Blender, providing a high-level API for scene
creation and rendering. Blendify reduces the complexity of working with
Blender's native API by automating object creation, handling the colors and
material linking, and implementing features such as shadow-catcher objects
while maintaining support for high-quality ray-tracing rendering output. With a
focus on usability Blendify enables efficient and flexible rendering workflow
for rendering in common computer vision and computer graphics use cases. The
code is available at https://github.com/ptrvilya/blendify
comment: Project page: https://virtualhumans.mpi-inf.mpg.de/blendify/
☆ ROCKET-1: Master Open-World Interaction with Visual-Temporal Context Prompting
Vision-language models (VLMs) have excelled in multimodal tasks, but adapting
them to embodied decision-making in open-world environments presents
challenges. A key issue is the difficulty in smoothly connecting individual
entities in low-level observations with abstract concepts required for
planning. A common approach to address this problem is through the use of
hierarchical agents, where VLMs serve as high-level reasoners that break down
tasks into executable sub-tasks, typically specified using language and
imagined observations. However, language often fails to effectively convey
spatial information, while generating future images with sufficient accuracy
remains challenging. To address these limitations, we propose visual-temporal
context prompting, a novel communication protocol between VLMs and policy
models. This protocol leverages object segmentation from both past and present
observations to guide policy-environment interactions. Using this approach, we
train ROCKET-1, a low-level policy that predicts actions based on concatenated
visual observations and segmentation masks, with real-time object tracking
provided by SAM-2. Our method unlocks the full potential of VLMs
visual-language reasoning abilities, enabling them to solve complex creative
tasks, especially those heavily reliant on spatial understanding. Experiments
in Minecraft demonstrate that our approach allows agents to accomplish
previously unattainable tasks, highlighting the effectiveness of
visual-temporal context prompting in embodied decision-making. Codes and demos
will be available on the project page: https://craftjarvis.github.io/ROCKET-1.
☆ TAGE: Trustworthy Attribute Group Editing for Stable Few-shot Image Generation
Generative Adversarial Networks (GANs) have emerged as a prominent research
focus for image editing tasks, leveraging the powerful image generation
capabilities of the GAN framework to produce remarkable results.However,
prevailing approaches are contingent upon extensive training datasets and
explicit supervision, presenting a significant challenge in manipulating the
diverse attributes of new image classes with limited sample availability. To
surmount this hurdle, we introduce TAGE, an innovative image generation network
comprising three integral modules: the Codebook Learning Module (CLM), the Code
Prediction Module (CPM) and the Prompt-driven Semantic Module (PSM). The CPM
module delves into the semantic dimensions of category-agnostic attributes,
encapsulating them within a discrete codebook. This module is predicated on the
concept that images are assemblages of attributes, and thus, by editing these
category-independent attributes, it is theoretically possible to generate
images from unseen categories. Subsequently, the CPM module facilitates
naturalistic image editing by predicting indices of category-independent
attribute vectors within the codebook. Additionally, the PSM module generates
semantic cues that are seamlessly integrated into the Transformer architecture
of the CPM, enhancing the model's comprehension of the targeted attributes for
editing. With these semantic cues, the model can generate images that
accentuate desired attributes more prominently while maintaining the integrity
of the original category, even with a limited number of samples. We have
conducted extensive experiments utilizing the Animal Faces, Flowers, and
VGGFaces datasets. The results of these experiments demonstrate that our
proposed method not only achieves superior performance but also exhibits a high
degree of stability when compared to other few-shot image generation
techniques.
comment: Accepted by International Conference on Signal Processing Systems
Conference
☆ Few-shot NeRF by Adaptive Rendering Loss Regularization ECCV2024
Novel view synthesis with sparse inputs poses great challenges to Neural
Radiance Field (NeRF). Recent works demonstrate that the frequency
regularization of Positional Encoding (PE) can achieve promising results for
few-shot NeRF. In this work, we reveal that there exists an inconsistency
between the frequency regularization of PE and rendering loss. This prevents
few-shot NeRF from synthesizing higher-quality novel views. To mitigate this
inconsistency, we propose Adaptive Rendering loss regularization for few-shot
NeRF, dubbed AR-NeRF. Specifically, we present a two-phase rendering
supervision and an adaptive rendering loss weight learning strategy to align
the frequency relationship between PE and 2D-pixel supervision. In this way,
AR-NeRF can learn global structures better in the early training phase and
adaptively learn local details throughout the training process. Extensive
experiments show that our AR-NeRF achieves state-of-the-art performance on
different datasets, including object-level and complex scenes.
comment: Accepted by ECCV2024
☆ Exploiting Text-Image Latent Spaces for the Description of Visual Concepts ICPR
Concept Activation Vectors (CAVs) offer insights into neural network
decision-making by linking human friendly concepts to the model's internal
feature extraction process. However, when a new set of CAVs is discovered, they
must still be translated into a human understandable description. For
image-based neural networks, this is typically done by visualizing the most
relevant images of a CAV, while the determination of the concept is left to
humans. In this work, we introduce an approach to aid the interpretation of
newly discovered concept sets by suggesting textual descriptions for each CAV.
This is done by mapping the most relevant images representing a CAV into a
text-image embedding where a joint description of these relevant images can be
computed. We propose utilizing the most relevant receptive fields instead of
full images encoded. We demonstrate the capabilities of this approach in
multiple experiments with and without given CAV labels, showing that the
proposed approach provides accurate descriptions for the CAVs and reduces the
challenge of concept interpretation.
comment: 19 pages, 7 figures, to be published in ICPR
☆ Att2CPC: Attention-Guided Lossy Attribute Compression of Point Clouds
With the great progress of 3D sensing and acquisition technology, the volume
of point cloud data has grown dramatically, which urges the development of
efficient point cloud compression methods. In this paper, we focus on the task
of learned lossy point cloud attribute compression (PCAC). We propose an
efficient attention-based method for lossy compression of point cloud
attributes leveraging on an autoencoder architecture. Specifically, at the
encoding side, we conduct multiple downsampling to best exploit the local
attribute patterns, in which effective External Cross Attention (ECA) is
devised to hierarchically aggregate features by intergrating attributes and
geometry contexts. At the decoding side, the attributes of the point cloud are
progressively reconstructed based on the multi-scale representation and the
zero-padding upsampling tactic. To the best of our knowledge, this is the first
approach to introduce attention mechanism to point-based lossy PCAC task. We
verify the compression efficiency of our model on various sequences, including
human body frames, sparse objects, and large-scale point cloud scenes.
Experiments show that our method achieves an average improvement of 1.15 dB and
2.13 dB in BD-PSNR of Y channel and YUV channel, respectively, when comparing
with the state-of-the-art point-based method Deep-PCAC. Codes of this paper are
available at https://github.com/I2-Multimedia-Lab/Att2CPC.
☆ DREB-Net: Dual-stream Restoration Embedding Blur-feature Fusion Network for High-mobility UAV Object Detection
Object detection algorithms are pivotal components of unmanned aerial vehicle
(UAV) imaging systems, extensively employed in complex fields. However, images
captured by high-mobility UAVs often suffer from motion blur cases, which
significantly impedes the performance of advanced object detection algorithms.
To address these challenges, we propose an innovative object detection
algorithm specifically designed for blurry images, named DREB-Net (Dual-stream
Restoration Embedding Blur-feature Fusion Network). First, DREB-Net addresses
the particularities of blurry image object detection problem by incorporating a
Blurry image Restoration Auxiliary Branch (BRAB) during the training phase.
Second, it fuses the extracted shallow features via Multi-level
Attention-Guided Feature Fusion (MAGFF) module, to extract richer features.
Here, the MAGFF module comprises local attention modules and global attention
modules, which assign different weights to the branches. Then, during the
inference phase, the deep feature extraction of the BRAB can be removed to
reduce computational complexity and improve detection speed. In loss function,
a combined loss of MSE and SSIM is added to the BRAB to restore blurry images.
Finally, DREB-Net introduces Fast Fourier Transform in the early stages of
feature extraction, via a Learnable Frequency domain Amplitude Modulation
Module (LFAMM), to adjust feature amplitude and enhance feature processing
capability. Experimental results indicate that DREB-Net can still effectively
perform object detection tasks under motion blur in captured images, showcasing
excellent performance and broad application prospects. Our source code will be
available at https://github.com/EEIC-Lab/DREB-Net.git.
☆ Deep Learning for Active Region Classification: A Systematic Study from Convolutional Neural Networks to Vision Transformers
A solar active region can significantly disrupt the Sun Earth space
environment, often leading to severe space weather events such as solar flares
and coronal mass ejections. As a consequence, the automatic classification of
active region groups is the crucial starting point for accurately and promptly
predicting solar activity. This study presents our results concerned with the
application of deep learning techniques to the classification of active region
cutouts based on the Mount Wilson classification scheme. Specifically, we have
explored the latest advancements in image classification architectures, from
Convolutional Neural Networks to Vision Transformers, and reported on their
performances for the active region classification task, showing that the
crucial point for their effectiveness consists in a robust training process
based on the latest advances in the field.
☆ Learning Lossless Compression for High Bit-Depth Volumetric Medical Image
Recent advances in learning-based methods have markedly enhanced the
capabilities of image compression. However, these methods struggle with high
bit-depth volumetric medical images, facing issues such as degraded
performance, increased memory demand, and reduced processing speed. To address
these challenges, this paper presents the Bit-Division based Lossless
Volumetric Image Compression (BD-LVIC) framework, which is tailored for high
bit-depth medical volume compression. The BD-LVIC framework skillfully divides
the high bit-depth volume into two lower bit-depth segments: the Most
Significant Bit-Volume (MSBV) and the Least Significant Bit-Volume (LSBV). The
MSBV concentrates on the most significant bits of the volumetric medical image,
capturing vital structural details in a compact manner. This reduction in
complexity greatly improves compression efficiency using traditional codecs.
Conversely, the LSBV deals with the least significant bits, which encapsulate
intricate texture details. To compress this detailed information effectively,
we introduce an effective learning-based compression model equipped with a
Transformer-Based Feature Alignment Module, which exploits both intra-slice and
inter-slice redundancies to accurately align features. Subsequently, a Parallel
Autoregressive Coding Module merges these features to precisely estimate the
probability distribution of the least significant bit-planes. Our extensive
testing demonstrates that the BD-LVIC framework not only sets new performance
benchmarks across various datasets but also maintains a competitive coding
speed, highlighting its significant potential and practical utility in the
realm of volumetric medical image compression.
comment: 13 pages
☆ PGDiffSeg: Prior-Guided Denoising Diffusion Model with Parameter-Shared Attention for Breast Cancer Segmentation
Early detection through imaging and accurate diagnosis is crucial in
mitigating the high mortality rate associated with breast cancer. However,
locating tumors from low-resolution and high-noise medical images is extremely
challenging. Therefore, this paper proposes a novel PGDiffSeg (Prior-Guided
Diffusion Denoising Model with Parameter-Shared Attention) that applies
diffusion denoising methods to breast cancer medical image segmentation,
accurately recovering the affected areas from Gaussian noise. Firstly, we
design a parallel pipeline for noise processing and semantic information
processing and propose a parameter-shared attention module (PSA) in multi-layer
that seamlessly integrates these two pipelines. This integration empowers
PGDiffSeg to incorporate semantic details at multiple levels during the
denoising process, producing highly accurate segmentation maps. Secondly, we
introduce a guided strategy that leverages prior knowledge to simulate the
decision-making process of medical professionals, thereby enhancing the model's
ability to locate tumor positions precisely. Finally, we provide the first-ever
discussion on the interpretability of the generative diffusion model in the
context of breast cancer segmentation. Extensive experiments have demonstrated
the superiority of our model over the current state-of-the-art approaches,
confirming its effectiveness as a flexible diffusion denoising method suitable
for medical image research. Our code will be publicly available later.
☆ EntityCLIP: Entity-Centric Image-Text Matching via Multimodal Attentive Contrastive Learning
Recent advancements in image-text matching have been notable, yet prevailing
models predominantly cater to broad queries and struggle with accommodating
fine-grained query intention. In this paper, we work towards the
\textbf{E}ntity-centric \textbf{I}mage-\textbf{T}ext \textbf{M}atching (EITM),
a task that the text and image involve specific entity-related information. The
challenge of this task mainly lies in the larger semantic gap in entity
association modeling, comparing with the general image-text matching problem.To
narrow the huge semantic gap between the entity-centric text and the images, we
take the fundamental CLIP as the backbone and devise a multimodal attentive
contrastive learning framework to tam CLIP to adapt EITM problem, developing a
model named EntityCLIP. The key of our multimodal attentive contrastive
learning is to generate interpretive explanation text using Large Language
Models (LLMs) as the bridge clues. In specific, we proceed by extracting
explanatory text from off-the-shelf LLMs. This explanation text, coupled with
the image and text, is then input into our specially crafted Multimodal
Attentive Experts (MMAE) module, which effectively integrates explanation texts
to narrow the gap of the entity-related text and image in a shared semantic
space. Building on the enriched features derived from MMAE, we further design
an effective Gated Integrative Image-text Matching (GI-ITM) strategy. The
GI-ITM employs an adaptive gating mechanism to aggregate MMAE's features,
subsequently applying image-text matching constraints to steer the alignment
between the text and the image. Extensive experiments are conducted on three
social media news benchmarks including N24News, VisualNews, and GoodNews, the
results shows that our method surpasses the competition methods with a clear
margin.
☆ An Intelligent Agentic System for Complex Image Restoration Problems
Real-world image restoration (IR) is inherently complex and often requires
combining multiple specialized models to address diverse degradations. Inspired
by human problem-solving, we propose AgenticIR, an agentic system that mimics
the human approach to image processing by following five key stages:
Perception, Scheduling, Execution, Reflection, and Rescheduling. AgenticIR
leverages large language models (LLMs) and vision-language models (VLMs) that
interact via text generation to dynamically operate a toolbox of IR models. We
fine-tune VLMs for image quality analysis and employ LLMs for reasoning,
guiding the system step by step. To compensate for LLMs' lack of specific IR
knowledge and experience, we introduce a self-exploration method, allowing the
LLM to observe and summarize restoration results into referenceable documents.
Experiments demonstrate AgenticIR's potential in handling complex IR tasks,
representing a promising path toward achieving general intelligence in visual
processing.
☆ GenUDC: High Quality 3D Mesh Generation with Unsigned Dual Contouring Representation
Generating high-quality meshes with complex structures and realistic surfaces
is the primary goal of 3D generative models. Existing methods typically employ
sequence data or deformable tetrahedral grids for mesh generation. However,
sequence-based methods have difficulty producing complex structures with many
faces due to memory limits. The deformable tetrahedral grid-based method
MeshDiffusion fails to recover realistic surfaces due to the inherent ambiguity
in deformable grids. We propose the GenUDC framework to address these
challenges by leveraging the Unsigned Dual Contouring (UDC) as the mesh
representation. UDC discretizes a mesh in a regular grid and divides it into
the face and vertex parts, recovering both complex structures and fine details.
As a result, the one-to-one mapping between UDC and mesh resolves the ambiguity
problem. In addition, GenUDC adopts a two-stage, coarse-to-fine generative
process for 3D mesh generation. It first generates the face part as a rough
shape and then the vertex part to craft a detailed shape. Extensive evaluations
demonstrate the superiority of UDC as a mesh representation and the favorable
performance of GenUDC in mesh generation. The code and trained models are
available at https://github.com/TrepangCat/GenUDC.
comment: ACMMM 2024, code:https://github.com/TrepangCat/GenUDC
☆ TranSPORTmer: A Holistic Approach to Trajectory Understanding in Multi-Agent Sports ACCV 2024
Understanding trajectories in multi-agent scenarios requires addressing
various tasks, including predicting future movements, imputing missing
observations, inferring the status of unseen agents, and classifying different
global states. Traditional data-driven approaches often handle these tasks
separately with specialized models. We introduce TranSPORTmer, a unified
transformer-based framework capable of addressing all these tasks, showcasing
its application to the intricate dynamics of multi-agent sports scenarios like
soccer and basketball. Using Set Attention Blocks, TranSPORTmer effectively
captures temporal dynamics and social interactions in an equivariant manner.
The model's tasks are guided by an input mask that conceals missing or
yet-to-be-predicted observations. Additionally, we introduce a CLS extra agent
to classify states along soccer trajectories, including passes, possessions,
uncontrolled states, and out-of-play intervals, contributing to an enhancement
in modeling trajectories. Evaluations on soccer and basketball datasets show
that TranSPORTmer outperforms state-of-the-art task-specific models in player
forecasting, player forecasting-imputation, ball inference, and ball
imputation. https://youtu.be/8VtSRm8oGoE
comment: Accepted to ACCV 2024
☆ ADEM-VL: Adaptive and Embedded Fusion for Efficient Vision-Language Tuning
Recent advancements in multimodal fusion have witnessed the remarkable
success of vision-language (VL) models, which excel in various multimodal
applications such as image captioning and visual question answering. However,
building VL models requires substantial hardware resources, where efficiency is
restricted by two key factors: the extended input sequence of the language
model with vision features demands more computational operations, and a large
number of additional learnable parameters increase memory complexity. These
challenges significantly restrict the broader applicability of such models. To
bridge this gap, we propose ADEM-VL, an efficient vision-language method that
tunes VL models based on pretrained large language models (LLMs) by adopting a
parameter-free cross-attention mechanism for similarity measurements in
multimodal fusion. This approach only requires embedding vision features into
the language space, significantly reducing the number of trainable parameters
and accelerating both training and inference speeds. To enhance representation
learning in fusion module, we introduce an efficient multiscale feature
generation scheme that requires only a single forward pass through the vision
encoder. Moreover, we propose an adaptive fusion scheme that dynamically
discards less relevant visual information for each text token based on its
attention score. This ensures that the fusion process prioritizes the most
pertinent visual features. With experiments on various tasks including visual
question answering, image captioning, and instruction-following, we demonstrate
that our framework outperforms existing approaches. Specifically, our method
surpasses existing methods by an average accuracy of 0.77% on ScienceQA
dataset, with reduced training and inference latency, demonstrating the
superiority of our framework. The code is available at
https://github.com/Hao840/ADEM-VL.
☆ Quasi-Medial Distance Field (Q-MDF): A Robust Method for Approximating and Discretizing Neural Medial Axis
The medial axis, a lower-dimensional shape descriptor, plays an important
role in the field of digital geometry processing. Despite its importance,
robust computation of the medial axis transform from diverse inputs, especially
point clouds with defects, remains a significant challenge. In this paper, we
tackle the challenge by proposing a new implicit method that diverges from
mainstream explicit medial axis computation techniques. Our key technical
insight is the difference between the signed distance field (SDF) and the
medial field (MF) of a solid shape is the unsigned distance field (UDF) of the
shape's medial axis. This allows for formulating medial axis computation as an
implicit reconstruction problem. Utilizing a modified double covering method,
we extract the medial axis as the zero level-set of the UDF. Extensive
experiments show that our method has enhanced accuracy and robustness in
learning compact medial axis transform from thorny meshes and point clouds
compared to existing methods.
☆ Scaling Robot Policy Learning via Zero-Shot Labeling with Foundation Models
Nils Blank, Moritz Reuss, Marcel Rühle, Ömer Erdinç Yağmurlu, Fabian Wenzel, Oier Mees, Rudolf Lioutikov
A central challenge towards developing robots that can relate human language
to their perception and actions is the scarcity of natural language annotations
in diverse robot datasets. Moreover, robot policies that follow natural
language instructions are typically trained on either templated language or
expensive human-labeled instructions, hindering their scalability. To this end,
we introduce NILS: Natural language Instruction Labeling for Scalability. NILS
automatically labels uncurated, long-horizon robot data at scale in a zero-shot
manner without any human intervention. NILS combines pretrained vision-language
foundation models in order to detect objects in a scene, detect object-centric
changes, segment tasks from large datasets of unlabelled interaction data and
ultimately label behavior datasets. Evaluations on BridgeV2, Fractal, and a
kitchen play dataset show that NILS can autonomously annotate diverse robot
demonstrations of unlabeled and unstructured datasets while alleviating several
shortcomings of crowdsourced human annotations, such as low data quality and
diversity. We use NILS to label over 115k trajectories obtained from over 430
hours of robot data. We open-source our auto-labeling code and generated
annotations on our website: http://robottasklabeling.github.io.
comment: Project Website at https://robottasklabeling.github.io/
☆ AdaDiffSR: Adaptive Region-aware Dynamic Acceleration Diffusion Model for Real-World Image Super-Resolution ECCV2024
Diffusion models (DMs) have shown promising results on single-image
super-resolution and other image-to-image translation tasks. Benefiting from
more computational resources and longer inference times, they are able to yield
more realistic images. Existing DMs-based super-resolution methods try to
achieve an overall average recovery over all regions via iterative refinement,
ignoring the consideration that different input image regions require different
timesteps to reconstruct. In this work, we notice that previous DMs-based
super-resolution methods suffer from wasting computational resources to
reconstruct invisible details. To further improve the utilization of
computational resources, we propose AdaDiffSR, a DMs-based SR pipeline with
dynamic timesteps sampling strategy (DTSS). Specifically, by introducing the
multi-metrics latent entropy module (MMLE), we can achieve dynamic perception
of the latent spatial information gain during the denoising process, thereby
guiding the dynamic selection of the timesteps. In addition, we adopt a
progressive feature injection module (PFJ), which dynamically injects the
original image features into the denoising process based on the current
information gain, so as to generate images with both fidelity and realism.
Experiments show that our AdaDiffSR achieves comparable performance over
current state-of-the-art DMs-based SR methods while consuming less
computational resources and inference time on both synthetic and real-world
datasets.
comment: 18 pages, 6 figures, ECCV2024 accepted
☆ VISAGE: Video Synthesis using Action Graphs for Surgery MICCAI 2024
Yousef Yeganeh, Rachmadio Lazuardi, Amir Shamseddin, Emine Dari, Yash Thirani, Nassir Navab Azade Farshad
Surgical data science (SDS) is a field that analyzes patient data before,
during, and after surgery to improve surgical outcomes and skills. However,
surgical data is scarce, heterogeneous, and complex, which limits the
applicability of existing machine learning methods. In this work, we introduce
the novel task of future video generation in laparoscopic surgery. This task
can augment and enrich the existing surgical data and enable various
applications, such as simulation, analysis, and robot-aided surgery.
Ultimately, it involves not only understanding the current state of the
operation but also accurately predicting the dynamic and often unpredictable
nature of surgical procedures. Our proposed method, VISAGE (VIdeo Synthesis
using Action Graphs for Surgery), leverages the power of action scene graphs to
capture the sequential nature of laparoscopic procedures and utilizes diffusion
models to synthesize temporally coherent video sequences. VISAGE predicts the
future frames given only a single initial frame, and the action graph triplets.
By incorporating domain-specific knowledge through the action graph, VISAGE
ensures the generated videos adhere to the expected visual and motion patterns
observed in real laparoscopic procedures. The results of our experiments
demonstrate high-fidelity video generation for laparoscopy procedures, which
enables various applications in SDS.
comment: Accepted at MICCAI 2024 Embodied AI and Robotics for HealTHcare
(EARTH) Workshop
☆ Efficient Neural Implicit Representation for 3D Human Reconstruction
High-fidelity digital human representations are increasingly in demand in the
digital world, particularly for interactive telepresence, AR/VR, 3D graphics,
and the rapidly evolving metaverse. Even though they work well in small spaces,
conventional methods for reconstructing 3D human motion frequently require the
use of expensive hardware and have high processing costs. This study presents
HumanAvatar, an innovative approach that efficiently reconstructs precise human
avatars from monocular video sources. At the core of our methodology, we
integrate the pre-trained HuMoR, a model celebrated for its proficiency in
human motion estimation. This is adeptly fused with the cutting-edge neural
radiance field technology, Instant-NGP, and the state-of-the-art articulated
model, Fast-SNARF, to enhance the reconstruction fidelity and speed. By
combining these two technologies, a system is created that can render quickly
and effectively while also providing estimation of human pose parameters that
are unmatched in accuracy. We have enhanced our system with an advanced
posture-sensitive space reduction technique, which optimally balances rendering
quality with computational efficiency. In our detailed experimental analysis
using both artificial and real-world monocular videos, we establish the
advanced performance of our approach. HumanAvatar consistently equals or
surpasses contemporary leading-edge reconstruction techniques in quality.
Furthermore, it achieves these complex reconstructions in minutes, a fraction
of the time typically required by existing methods. Our models achieve a
training speed that is 110X faster than that of State-of-The-Art (SoTA)
NeRF-based models. Our technique performs noticeably better than SoTA dynamic
human NeRF methods if given an identical runtime limit. HumanAvatar can provide
effective visuals after only 30 seconds of training.
☆ Emotion Recognition with Facial Attention and Objective Activation Functions
In this paper, we study the effect of introducing channel and spatial
attention mechanisms, namely SEN-Net, ECA-Net, and CBAM, to existing CNN
vision-based models such as VGGNet, ResNet, and ResNetV2 to perform the Facial
Emotion Recognition task. We show that not only attention can significantly
improve the performance of these models but also that combining them with a
different activation function can further help increase the performance of
these models.
☆ New Insight in Cervical Cancer Diagnosis Using Convolution Neural Network Architecture
The Pap smear is a screening method for early cervical cancer diagnosis. The
selection of the right optimizer in the convolutional neural network (CNN)
model is key to the success of the CNN in image classification, including the
classification of cervical cancer Pap smear images. In this study, stochastic
gradient descent (SGD), RMSprop, Adam, AdaGrad, AdaDelta, Adamax, and Nadam
optimizers were used to classify cervical cancer Pap smear images from the
SipakMed dataset. Resnet-18, Resnet-34, and VGG-16 are the CNN architectures
used in this study, and each architecture uses a transfer-learning model. Based
on the test results, we conclude that the transfer learning model performs
better on all CNNs and optimization techniques and that in the transfer
learning model, the optimization has little influence on the training of the
model. Adamax, with accuracy values of 72.8% and 66.8%, had the best accuracy
for the VGG-16 and Resnet-18 architectures, respectively. Resnet-34 had 54.0%.
This is 0.034% lower than Nadam. Overall, Adamax is a suitable optimizer for
CNN in cervical cancer classification on Resnet-18, Resnet-34, and VGG-16
architectures. This study provides new insights into the configuration of CNN
models for Pap smear image analysis.
☆ YOLO-Vehicle-Pro: A Cloud-Edge Collaborative Framework for Object Detection in Autonomous Driving under Adverse Weather Conditions
With the rapid advancement of autonomous driving technology, efficient and
accurate object detection capabilities have become crucial factors in ensuring
the safety and reliability of autonomous driving systems. However, in
low-visibility environments such as hazy conditions, the performance of
traditional object detection algorithms often degrades significantly, failing
to meet the demands of autonomous driving. To address this challenge, this
paper proposes two innovative deep learning models: YOLO-Vehicle and
YOLO-Vehicle-Pro. YOLO-Vehicle is an object detection model tailored
specifically for autonomous driving scenarios, employing multimodal fusion
techniques to combine image and textual information for object detection.
YOLO-Vehicle-Pro builds upon this foundation by introducing an improved image
dehazing algorithm, enhancing detection performance in low-visibility
environments. In addition to model innovation, this paper also designs and
implements a cloud-edge collaborative object detection system, deploying models
on edge devices and offloading partial computational tasks to the cloud in
complex situations. Experimental results demonstrate that on the KITTI dataset,
the YOLO-Vehicle-v1s model achieved 92.1% accuracy while maintaining a
detection speed of 226 FPS and an inference time of 12ms, meeting the real-time
requirements of autonomous driving. When processing hazy images, the
YOLO-Vehicle-Pro model achieved a high accuracy of 82.3% mAP@50 on the Foggy
Cityscapes dataset while maintaining a detection speed of 43 FPS.
☆ YOLOv11: An Overview of the Key Architectural Enhancements
This study presents an architectural analysis of YOLOv11, the latest
iteration in the YOLO (You Only Look Once) series of object detection models.
We examine the models architectural innovations, including the introduction of
the C3k2 (Cross Stage Partial with kernel size 2) block, SPPF (Spatial Pyramid
Pooling - Fast), and C2PSA (Convolutional block with Parallel Spatial
Attention) components, which contribute in improving the models performance in
several ways such as enhanced feature extraction. The paper explores YOLOv11's
expanded capabilities across various computer vision tasks, including object
detection, instance segmentation, pose estimation, and oriented object
detection (OBB). We review the model's performance improvements in terms of
mean Average Precision (mAP) and computational efficiency compared to its
predecessors, with a focus on the trade-off between parameter count and
accuracy. Additionally, the study discusses YOLOv11's versatility across
different model sizes, from nano to extra-large, catering to diverse
application needs from edge devices to high-performance computing environments.
Our research provides insights into YOLOv11's position within the broader
landscape of object detection and its potential impact on real-time computer
vision applications.
☆ Continual Learning on a Data Diet
Continual Learning (CL) methods usually learn from all available data.
However, this is not the case in human cognition which efficiently focuses on
key experiences while disregarding the redundant information. Similarly, not
all data points in a dataset have equal potential; some can be more informative
than others. This disparity may significantly impact the performance, as both
the quality and quantity of samples directly influence the model's
generalizability and efficiency. Drawing inspiration from this, we explore the
potential of learning from important samples and present an empirical study for
evaluating coreset selection techniques in the context of CL to stimulate
research in this unexplored area. We train different continual learners on
increasing amounts of selected samples and investigate the learning-forgetting
dynamics by shedding light on the underlying mechanisms driving their improved
stability-plasticity balance. We present several significant observations:
learning from selectively chosen samples (i) enhances incremental accuracy,
(ii) improves knowledge retention of previous tasks, and (iii) refines learned
representations. This analysis contributes to a deeper understanding of
selective learning strategies in CL scenarios.
comment: 18 pages, 6 figures
☆ Longitudinal Causal Image Synthesis
Clinical decision-making relies heavily on causal reasoning and longitudinal
analysis. For example, for a patient with Alzheimer's disease (AD), how will
the brain grey matter atrophy in a year if intervened on the A-beta level in
cerebrospinal fluid? The answer is fundamental to diagnosis and follow-up
treatment. However, this kind of inquiry involves counterfactual medical images
which can not be acquired by instrumental or correlation-based image synthesis
models. Yet, such queries require counterfactual medical images, not obtainable
through standard image synthesis models. Hence, a causal longitudinal image
synthesis (CLIS) method, enabling the synthesis of such images, is highly
valuable. However, building a CLIS model confronts three primary yet unmet
challenges: mismatched dimensionality between high-dimensional images and
low-dimensional tabular variables, inconsistent collection intervals of
follow-up data, and inadequate causal modeling capability of existing causal
graph methods for image data. In this paper, we established a tabular-visual
causal graph (TVCG) for CLIS overcoming these challenges through a novel
integration of generative imaging, continuous-time modeling, and structural
causal models combined with a neural network. We train our CLIS based on the
ADNI dataset and evaluate it on two other AD datasets, which illustrate the
outstanding yet controllable quality of the synthesized images and the
contributions of synthesized MRI to the characterization of AD progression,
substantiating the reliability and utility in clinics.
☆ Deep Generative Models for 3D Medical Image Synthesis
Deep generative modeling has emerged as a powerful tool for synthesizing
realistic medical images, driving advances in medical image analysis, disease
diagnosis, and treatment planning. This chapter explores various deep
generative models for 3D medical image synthesis, with a focus on Variational
Autoencoders (VAEs), Generative Adversarial Networks (GANs), and Denoising
Diffusion Models (DDMs). We discuss the fundamental principles, recent
advances, as well as strengths and weaknesses of these models and examine their
applications in clinically relevant problems, including unconditional and
conditional generation tasks like image-to-image translation and image
reconstruction. We additionally review commonly used evaluation metrics for
assessing image fidelity, diversity, utility, and privacy and provide an
overview of current challenges in the field.
☆ Surgical Scene Segmentation by Transformer With Asymmetric Feature Enhancement
Surgical scene segmentation is a fundamental task for robotic-assisted
laparoscopic surgery understanding. It often contains various anatomical
structures and surgical instruments, where similar local textures and
fine-grained structures make the segmentation a difficult task. Vision-specific
transformer method is a promising way for surgical scene understanding.
However, there are still two main challenges. Firstly, the absence of
inner-patch information fusion leads to poor segmentation performance.
Secondly, the specific characteristics of anatomy and instruments are not
specifically modeled. To tackle the above challenges, we propose a novel
Transformer-based framework with an Asymmetric Feature Enhancement module
(TAFE), which enhances local information and then actively fuses the improved
feature pyramid into the embeddings from transformer encoders by a multi-scale
interaction attention strategy. The proposed method outperforms the SOTA
methods in several different surgical segmentation tasks and additionally
proves its ability of fine-grained structure recognition. Code is available at
https://github.com/cyuan-sjtu/ViT-asym.
☆ MIA-DPO: Multi-Image Augmented Direct Preference Optimization For Large Vision-Language Models
Ziyu Liu, Yuhang Zang, Xiaoyi Dong, Pan Zhang, Yuhang Cao, Haodong Duan, Conghui He, Yuanjun Xiong, Dahua Lin, Jiaqi Wang
Visual preference alignment involves training Large Vision-Language Models
(LVLMs) to predict human preferences between visual inputs. This is typically
achieved by using labeled datasets of chosen/rejected pairs and employing
optimization algorithms like direct preference optimization (DPO). Existing
visual alignment methods, primarily designed for single-image scenarios,
struggle to effectively handle the complexity of multi-image tasks due to the
scarcity of diverse training data and the high cost of annotating
chosen/rejected pairs. We present Multi-Image Augmented Direct Preference
Optimization (MIA-DPO), a visual preference alignment approach that effectively
handles multi-image inputs. MIA-DPO mitigates the scarcity of diverse
multi-image training data by extending single-image data with unrelated images
arranged in grid collages or pic-in-pic formats, significantly reducing the
costs associated with multi-image data annotations. Our observation reveals
that attention values of LVLMs vary considerably across different images. We
use attention values to identify and filter out rejected responses the model
may have mistakenly focused on. Our attention-aware selection for constructing
the chosen/rejected pairs without relying on (i) human annotation, (ii) extra
data, and (iii) external models or APIs. MIA-DPO is compatible with various
architectures and outperforms existing methods on five multi-image benchmarks,
achieving an average performance boost of 3.0% on LLaVA-v1.5 and 4.3% on the
recent InternLM-XC2.5. Moreover, MIA-DPO has a minimal effect on the model's
ability to understand single images.
comment: Project URL: https://github.com/Liuziyu77/MIA-DPO
☆ Bridging the Gaps: Utilizing Unlabeled Face Recognition Datasets to Boost Semi-Supervised Facial Expression Recognition
In recent years, Facial Expression Recognition (FER) has gained increasing
attention. Most current work focuses on supervised learning, which requires a
large amount of labeled and diverse images, while FER suffers from the scarcity
of large, diverse datasets and annotation difficulty. To address these
problems, we focus on utilizing large unlabeled Face Recognition (FR) datasets
to boost semi-supervised FER. Specifically, we first perform face
reconstruction pre-training on large-scale facial images without annotations to
learn features of facial geometry and expression regions, followed by two-stage
fine-tuning on FER datasets with limited labels. In addition, to further
alleviate the scarcity of labeled and diverse images, we propose a Mixup-based
data augmentation strategy tailored for facial images, and the loss weights of
real and virtual images are determined according to the intersection-over-union
(IoU) of the faces in the two images. Experiments on RAF-DB, AffectNet, and
FERPlus show that our method outperforms existing semi-supervised FER methods
and achieves new state-of-the-art performance. Remarkably, with only 5%, 25%
training sets,our method achieves 64.02% on AffectNet,and 88.23% on RAF-DB,
which is comparable to fully supervised state-of-the-art methods. Codes will be
made publicly available at https://github.com/zhelishisongjie/SSFER.
☆ ImDy: Human Inverse Dynamics from Imitated Observations
Inverse dynamics (ID), which aims at reproducing the driven torques from
human kinematic observations, has been a critical tool for gait analysis.
However, it is hindered from wider application to general motion due to its
limited scalability. Conventional optimization-based ID requires expensive
laboratory setups, restricting its availability. To alleviate this problem, we
propose to exploit the recently progressive human motion imitation algorithms
to learn human inverse dynamics in a data-driven manner. The key insight is
that the human ID knowledge is implicitly possessed by motion imitators, though
not directly applicable. In light of this, we devise an efficient data
collection pipeline with state-of-the-art motion imitation algorithms and
physics simulators, resulting in a large-scale human inverse dynamics benchmark
as Imitated Dynamics (ImDy). ImDy contains over 150 hours of motion with joint
torque and full-body ground reaction force data. With ImDy, we train a
data-driven human inverse dynamics solver ImDyS(olver) in a fully supervised
manner, which conducts ID and ground reaction force estimation simultaneously.
Experiments on ImDy and real-world data demonstrate the impressive competency
of ImDyS in human inverse dynamics and ground reaction force estimation.
Moreover, the potential of ImDy(-S) as a fundamental motion analysis tool is
exhibited with downstream applications. The project page is
https://foruck.github.io/ImDy/.
comment: Yong-Lu Li and Cewu Lu are the corresponding authors
☆ Towards Effective Data-Free Knowledge Distillation via Diverse Diffusion Augmentation
Data-free knowledge distillation (DFKD) has emerged as a pivotal technique in
the domain of model compression, substantially reducing the dependency on the
original training data. Nonetheless, conventional DFKD methods that employ
synthesized training data are prone to the limitations of inadequate diversity
and discrepancies in distribution between the synthesized and original
datasets. To address these challenges, this paper introduces an innovative
approach to DFKD through diverse diffusion augmentation (DDA). Specifically, we
revise the paradigm of common data synthesis in DFKD to a composite process
through leveraging diffusion models subsequent to data synthesis for
self-supervised augmentation, which generates a spectrum of data samples with
similar distributions while retaining controlled variations. Furthermore, to
mitigate excessive deviation in the embedding space, we introduce an image
filtering technique grounded in cosine similarity to maintain fidelity during
the knowledge distillation process. Comprehensive experiments conducted on
CIFAR-10, CIFAR-100, and Tiny-ImageNet datasets showcase the superior
performance of our method across various teacher-student network
configurations, outperforming the contemporary state-of-the-art DFKD methods.
Code will be available at:https://github.com/SLGSP/DDA.
☆ PlantCamo: Plant Camouflage Detection
Camouflaged Object Detection (COD) aims to detect objects with camouflaged
properties. Although previous studies have focused on natural (animals and
insects) and unnatural (artistic and synthetic) camouflage detection, plant
camouflage has been neglected. However, plant camouflage plays a vital role in
natural camouflage. Therefore, this paper introduces a new challenging problem
of Plant Camouflage Detection (PCD). To address this problem, we introduce the
PlantCamo dataset, which comprises 1,250 images with camouflaged plants
representing 58 object categories in various natural scenes. To investigate the
current status of plant camouflage detection, we conduct a large-scale
benchmark study using 20+ cutting-edge COD models on the proposed dataset. Due
to the unique characteristics of plant camouflage, including holes and
irregular borders, we developed a new framework, named PCNet, dedicated to PCD.
Our PCNet surpasses performance thanks to its multi-scale global feature
enhancement and refinement. Finally, we discuss the potential applications and
insights, hoping this work fills the gap in fine-grained COD research and
facilitates further intelligent ecology research. All resources will be
available on https://github.com/yjybuaa/PlantCamo.
☆ How to Continually Adapt Text-to-Image Diffusion Models for Flexible Customization? NeurIPS2024
Jiahua Dong, Wenqi Liang, Hongliu Li, Duzhen Zhang, Meng Cao, Henghui Ding, Salman Khan, Fahad Shahbaz Khan
Custom diffusion models (CDMs) have attracted widespread attention due to
their astonishing generative ability for personalized concepts. However, most
existing CDMs unreasonably assume that personalized concepts are fixed and
cannot change over time. Moreover, they heavily suffer from catastrophic
forgetting and concept neglect on old personalized concepts when continually
learning a series of new concepts. To address these challenges, we propose a
novel Concept-Incremental text-to-image Diffusion Model (CIDM), which can
resolve catastrophic forgetting and concept neglect to learn new customization
tasks in a concept-incremental manner. Specifically, to surmount the
catastrophic forgetting of old concepts, we develop a concept consolidation
loss and an elastic weight aggregation module. They can explore task-specific
and task-shared knowledge during training, and aggregate all low-rank weights
of old concepts based on their contributions during inference. Moreover, in
order to address concept neglect, we devise a context-controllable synthesis
strategy that leverages expressive region features and noise estimation to
control the contexts of generated images according to user conditions.
Experiments validate that our CIDM surpasses existing custom diffusion models.
The source codes are available at https://github.com/JiahuaDong/CIFC.
comment: Accepted to NeurIPS2024
☆ Double Banking on Knowledge: Customized Modulation and Prototypes for Multi-Modality Semi-supervised Medical Image Segmentation
Multi-modality (MM) semi-supervised learning (SSL) based medical image
segmentation has recently gained increasing attention for its ability to
utilize MM data and reduce reliance on labeled images. However, current methods
face several challenges: (1) Complex network designs hinder scalability to
scenarios with more than two modalities. (2) Focusing solely on
modality-invariant representation while neglecting modality-specific features,
leads to incomplete MM learning. (3) Leveraging unlabeled data with generative
methods can be unreliable for SSL. To address these problems, we propose Double
Bank Dual Consistency (DBDC), a novel MM-SSL approach for medical image
segmentation. To address challenge (1), we propose a modality all-in-one
segmentation network that accommodates data from any number of modalities,
removing the limitation on modality count. To address challenge (2), we design
two learnable plug-in banks, Modality-Level Modulation bank (MLMB) and
Modality-Level Prototype (MLPB) bank, to capture both modality-invariant and
modality-specific knowledge. These banks are updated using our proposed
Modality Prototype Contrastive Learning (MPCL). Additionally, we design
Modality Adaptive Weighting (MAW) to dynamically adjust learning weights for
each modality, ensuring balanced MM learning as different modalities learn at
different rates. Finally, to address challenge (3), we introduce a Dual
Consistency (DC) strategy that enforces consistency at both the image and
feature levels without relying on generative methods. We evaluate our method on
a 2-to-4 modality segmentation task using three open-source datasets, and
extensive experiments show that our method outperforms state-of-the-art
approaches.
☆ BlurryScope: a cost-effective and compact scanning microscope for automated HER2 scoring using deep learning on blurry image data
We developed a rapid scanning optical microscope, termed "BlurryScope", that
leverages continuous image acquisition and deep learning to provide a
cost-effective and compact solution for automated inspection and analysis of
tissue sections. BlurryScope integrates specialized hardware with a neural
network-based model to quickly process motion-blurred histological images and
perform automated pathology classification. This device offers comparable speed
to commercial digital pathology scanners, but at a significantly lower price
point and smaller size/weight, making it ideal for fast triaging in small
clinics, as well as for resource-limited settings. To demonstrate the
proof-of-concept of BlurryScope, we implemented automated classification of
human epidermal growth factor receptor 2 (HER2) scores on immunohistochemically
(IHC) stained breast tissue sections, achieving concordant results with those
obtained from a high-end digital scanning microscope. We evaluated this
approach by scanning HER2-stained tissue microarrays (TMAs) at a continuous
speed of 5 mm/s, which introduces bidirectional motion blur artifacts. These
compromised images were then used to train our network models. Using a test set
of 284 unique patient cores, we achieved blind testing accuracies of 79.3% and
89.7% for 4-class (0, 1+, 2+, 3+) and 2-class (0/1+ , 2+/3+) HER2 score
classification, respectively. BlurryScope automates the entire workflow, from
image scanning to stitching and cropping of regions of interest, as well as
HER2 score classification. We believe BlurryScope has the potential to enhance
the current pathology infrastructure in resource-scarce environments, save
diagnostician time and bolster cancer identification and classification across
various clinical environments.
comment: 18 Pages, 6 Figures
☆ Unsupervised Low-dose CT Reconstruction with One-way Conditional Normalizing Flows
Deep-learning methods have shown promising performance for low-dose computed
tomography (LDCT) reconstruction. However, supervised methods face the problem
of lacking labeled data in clinical scenarios, and the CNN-based unsupervised
denoising methods would cause excessive smoothing in the reconstructed image.
Recently, the normalizing flows (NFs) based methods have shown advantages in
producing detail-rich images and avoiding over-smoothing, however, there are
still issues: (1) Although the alternating optimization in the data and latent
space can well utilize the regularization and generation capabilities of NFs,
the current two-way transformation strategy of noisy images and latent
variables would cause detail loss and secondary artifacts; and (2) Training NFs
on high-resolution CT images is hard due to huge computation. Though using
conditional normalizing flows (CNFs) to learn conditional probability can
reduce the computational burden, current methods require labeled data for
conditionalization, and the unsupervised CNFs-based LDCT reconstruction remains
a problem. To tackle these problems, we propose a novel CNFs-based unsupervised
LDCT iterative reconstruction algorithm. It employs strict one-way
transformation when performing alternating optimization in the dual spaces,
thus effectively avoiding the problems of detail loss and secondary artifacts.
By proposing a novel unsupervised conditionalization strategy, we train CNFs on
high-resolution CT images, thus achieving fast and high-quality unsupervised
reconstruction. Experiments on different datasets suggest that the performance
of the proposed algorithm could surpass some state-of-the-art unsupervised and
even supervised methods.
☆ OVT-B: A New Large-Scale Benchmark for Open-Vocabulary Multi-Object Tracking NeurIPS 2024
Open-vocabulary object perception has become an important topic in artificial
intelligence, which aims to identify objects with novel classes that have not
been seen during training. Under this setting, open-vocabulary object detection
(OVD) in a single image has been studied in many literature. However,
open-vocabulary object tracking (OVT) from a video has been studied less, and
one reason is the shortage of benchmarks. In this work, we have built a new
large-scale benchmark for open-vocabulary multi-object tracking namely OVT-B.
OVT-B contains 1,048 categories of objects and 1,973 videos with 637,608
bounding box annotations, which is much larger than the sole open-vocabulary
tracking dataset, i.e., OVTAO-val dataset (200+ categories, 900+ videos). The
proposed OVT-B can be used as a new benchmark to pave the way for OVT research.
We also develop a simple yet effective baseline method for OVT. It integrates
the motion features for object tracking, which is an important feature for MOT
but is ignored in previous OVT methods. Experimental results have verified the
usefulness of the proposed benchmark and the effectiveness of our method. We
have released the benchmark to the public at
https://github.com/Coo1Sea/OVT-B-Dataset.
comment: 15 pages, 6 figures, accepted at NeurIPS 2024 Dataset and Benchmark
Track
☆ Diffusion Priors for Variational Likelihood Estimation and Image Denoising NeurIPS2024
Real-world noise removal is crucial in low-level computer vision. Due to the
remarkable generation capabilities of diffusion models, recent attention has
shifted towards leveraging diffusion priors for image restoration tasks.
However, existing diffusion priors-based methods either consider simple noise
types or rely on approximate posterior estimation, limiting their effectiveness
in addressing structured and signal-dependent noise commonly found in
real-world images. In this paper, we build upon diffusion priors and propose
adaptive likelihood estimation and MAP inference during the reverse diffusion
process to tackle real-world noise. We introduce an independent,
non-identically distributed likelihood combined with the noise precision
(inverse variance) prior and dynamically infer the precision posterior using
variational Bayes during the generation process. Meanwhile, we rectify the
estimated noise variance through local Gaussian convolution. The final denoised
image is obtained by propagating intermediate MAP solutions that balance the
updated likelihood and diffusion prior. Additionally, we explore the local
diffusion prior inherent in low-resolution diffusion models, enabling direct
handling of high-resolution noisy images. Extensive experiments and analyses on
diverse real-world datasets demonstrate the effectiveness of our method. Code
is available at https://github.com/HUST-Tan/DiffusionVI.
comment: Accepted by NeurIPS2024 as Spotlight
☆ PathMoCo: A Novel Framework to Improve Feature Embedding in Self-supervised Contrastive Learning for Histopathological Images
Self-supervised contrastive learning has become a cornerstone in various
areas, particularly histopathological image analysis. Image augmentation plays
a crucial role in self-supervised contrastive learning, as it generates
variations in image samples. However, traditional image augmentation techniques
often overlook the unique characteristics of histopathological images. In this
paper, we propose a new histopathology-specific image augmentation method
called stain reconstruction augmentation (SRA). We integrate our SRA with MoCo
v3, a leading model in self-supervised contrastive learning, along with our
additional contrastive loss terms, and call the new model PathMoCo. We
demonstrate that our PathMoCo always outperforms the standard MoCo v3 across
various downstream tasks and achieves comparable or superior performance to
other foundation models pre-trained on significantly larger histopathology
datasets.
☆ HCDN: A Change Detection Network for Construction Housekeeping Using Feature Fusion and Large Vision Models
Workplace safety has received increasing attention as millions of workers
worldwide suffer from work-related accidents. Despite poor housekeeping is a
significant contributor to construction accidents, there remains a significant
lack of technological research focused on improving housekeeping practices in
construction sites. Recognizing and locating poor housekeeping in a dynamic
construction site is an important task that can be improved through computer
vision approaches. Despite advances in AI and computer vision, existing methods
for detecting poor housekeeping conditions face many challenges, including
limited explanations, lack of locating of poor housekeeping, and lack of
annotated datasets. On the other hand, change detection which aims to detect
the changed environmental conditions (e.g., changing from good to poor
housekeeping) and 'where' the change has occurred (e.g., location of objects
causing poor housekeeping), has not been explored to the problem of
housekeeping management. To address these challenges, we propose the
Housekeeping Change Detection Network (HCDN), an advanced change detection
neural network that integrates a feature fusion module and a large vision
model, achieving state-of-the-art performance. Additionally, we introduce the
approach to establish a novel change detection dataset (named Housekeeping-CCD)
focused on housekeeping in construction sites, along with a housekeeping
segmentation dataset. Our contributions include significant performance
improvements compared to existing methods, providing an effective tool for
enhancing construction housekeeping and safety. To promote further development,
we share our source code and trained models for global researchers:
https://github.com/NUS-DBE/Housekeeping-CD.
☆ PLGS: Robust Panoptic Lifting with 3D Gaussian Splatting
Previous methods utilize the Neural Radiance Field (NeRF) for panoptic
lifting, while their training and rendering speed are unsatisfactory. In
contrast, 3D Gaussian Splatting (3DGS) has emerged as a prominent technique due
to its rapid training and rendering speed. However, unlike NeRF, the
conventional 3DGS may not satisfy the basic smoothness assumption as it does
not rely on any parameterized structures to render (e.g., MLPs). Consequently,
the conventional 3DGS is, in nature, more susceptible to noisy 2D mask
supervision. In this paper, we propose a new method called PLGS that enables
3DGS to generate consistent panoptic segmentation masks from noisy 2D
segmentation masks while maintaining superior efficiency compared to NeRF-based
methods. Specifically, we build a panoptic-aware structured 3D Gaussian model
to introduce smoothness and design effective noise reduction strategies. For
the semantic field, instead of initialization with structure from motion, we
construct reliable semantic anchor points to initialize the 3D Gaussians. We
then use these anchor points as smooth regularization during training.
Additionally, we present a self-training approach using pseudo labels generated
by merging the rendered masks with the noisy masks to enhance the robustness of
PLGS. For the instance field, we project the 2D instance masks into 3D space
and match them with oriented bounding boxes to generate cross-view consistent
instance masks for supervision. Experiments on various benchmarks demonstrate
that our method outperforms previous state-of-the-art methods in terms of both
segmentation quality and speed.
☆ Bilateral Hippocampi Segmentation in Low Field MRIs Using Mutual Feature Learning via Dual-Views
Accurate hippocampus segmentation in brain MRI is critical for studying
cognitive and memory functions and diagnosing neurodevelopmental disorders.
While high-field MRIs provide detailed imaging, low-field MRIs are more
accessible and cost-effective, which eliminates the need for sedation in
children, though they often suffer from lower image quality. In this paper, we
present a novel deep-learning approach for the automatic segmentation of
bilateral hippocampi in low-field MRIs. Extending recent advancements in infant
brain segmentation to underserved communities through the use of low-field MRIs
ensures broader access to essential diagnostic tools, thereby supporting better
healthcare outcomes for all children. Inspired by our previous work, Co-BioNet,
the proposed model employs a dual-view structure to enable mutual feature
learning via high-frequency masking, enhancing segmentation accuracy by
leveraging complementary information from different perspectives. Extensive
experiments demonstrate that our method provides reliable segmentation outcomes
for hippocampal analysis in low-resource settings. The code is publicly
available at: https://github.com/himashi92/LoFiHippSeg.
☆ Enhancing Multimodal Medical Image Classification using Cross-Graph Modal Contrastive Learning
The classification of medical images is a pivotal aspect of disease
diagnosis, often enhanced by deep learning techniques. However, traditional
approaches typically focus on unimodal medical image data, neglecting the
integration of diverse non-image patient data. This paper proposes a novel
Cross-Graph Modal Contrastive Learning (CGMCL) framework for multimodal medical
image classification. The model effectively integrates both image and non-image
data by constructing cross-modality graphs and leveraging contrastive learning
to align multimodal features in a shared latent space. An inter-modality
feature scaling module further optimizes the representation learning process by
reducing the gap between heterogeneous modalities. The proposed approach is
evaluated on two datasets: a Parkinson's disease (PD) dataset and a public
melanoma dataset. Results demonstrate that CGMCL outperforms conventional
unimodal methods in accuracy, interpretability, and early disease prediction.
Additionally, the method shows superior performance in multi-class melanoma
classification. The CGMCL framework provides valuable insights into medical
image classification while offering improved disease interpretability and
predictive capabilities.
☆ Unsupervised Domain Adaptation for Action Recognition via Self-Ensembling and Conditional Embedding Alignment
Recent advancements in deep learning-based wearable human action recognition
(wHAR) have improved the capture and classification of complex motions, but
adoption remains limited due to the lack of expert annotations and domain
discrepancies from user variations. Limited annotations hinder the model's
ability to generalize to out-of-distribution samples. While data augmentation
can improve generalizability, unsupervised augmentation techniques must be
applied carefully to avoid introducing noise. Unsupervised domain adaptation
(UDA) addresses domain discrepancies by aligning conditional distributions with
labeled target samples, but vanilla pseudo-labeling can lead to error
propagation. To address these challenges, we propose $\mu$DAR, a novel joint
optimization architecture comprised of three functions: (i) consistency
regularizer between augmented samples to improve model classification
generalizability, (ii) temporal ensemble for robust pseudo-label generation and
(iii) conditional distribution alignment to improve domain generalizability.
The temporal ensemble works by aggregating predictions from past epochs to
smooth out noisy pseudo-label predictions, which are then used in the
conditional distribution alignment module to minimize kernel-based class-wise
conditional maximum mean discrepancy ($k$CMMD) between the source and target
feature space to learn a domain invariant embedding. The
consistency-regularized augmentations ensure that multiple augmentations of the
same sample share the same labels; this results in (a) strong generalization
with limited source domain samples and (b) consistent pseudo-label generation
in target samples. The novel integration of these three modules in $\mu$DAR
results in a range of $\approx$ 4-12% average macro-F1 score improvement over
six state-of-the-art UDA methods in four benchmark wHAR datasets
comment: This work has been accepted to the Proceedings of the IEEE
International Conference on Data Mining, 2024
☆ GenDP: 3D Semantic Fields for Category-Level Generalizable Diffusion Policy
Diffusion-based policies have shown remarkable capability in executing
complex robotic manipulation tasks but lack explicit characterization of
geometry and semantics, which often limits their ability to generalize to
unseen objects and layouts. To enhance the generalization capabilities of
Diffusion Policy, we introduce a novel framework that incorporates explicit
spatial and semantic information via 3D semantic fields. We generate 3D
descriptor fields from multi-view RGBD observations with large foundational
vision models, then compare these descriptor fields against reference
descriptors to obtain semantic fields. The proposed method explicitly considers
geometry and semantics, enabling strong generalization capabilities in tasks
requiring category-level generalization, resolving geometric ambiguities, and
attention to subtle geometric details. We evaluate our method across eight
tasks involving articulated objects and instances with varying shapes and
textures from multiple object categories. Our method demonstrates its
effectiveness by increasing Diffusion Policy's average success rate on unseen
instances from 20% to 93%. Additionally, we provide a detailed analysis and
visualization to interpret the sources of performance gain and explain how our
method can generalize to novel instances.
comment: Accepted to Conference on Robot Learning (CoRL 2024). Project Page:
https://robopil.github.io/GenDP/
☆ Which Client is Reliable?: A Reliable and Personalized Prompt-based Federated Learning for Medical Image Question Answering
Conventional medical artificial intelligence (AI) models face barriers in
clinical application and ethical issues owing to their inability to handle the
privacy-sensitive characteristics of medical data. We present a novel
personalized federated learning (pFL) method for medical visual question
answering (VQA) models, addressing privacy reliability challenges in the
medical domain. Our method introduces learnable prompts into a Transformer
architecture to efficiently train it on diverse medical datasets without
massive computational costs. Then we introduce a reliable client VQA model that
incorporates Dempster-Shafer evidence theory to quantify uncertainty in
predictions, enhancing the model's reliability. Furthermore, we propose a novel
inter-client communication mechanism that uses maximum likelihood estimation to
balance accuracy and uncertainty, fostering efficient integration of insights
across clients.
♻ ☆ Pruning By Explaining Revisited: Optimizing Attribution Methods to Prune CNNs and Transformers ECCV 2024
Sayed Mohammad Vakilzadeh Hatefi, Maximilian Dreyer, Reduan Achtibat, Thomas Wiegand, Wojciech Samek, Sebastian Lapuschkin
To solve ever more complex problems, Deep Neural Networks are scaled to
billions of parameters, leading to huge computational costs. An effective
approach to reduce computational requirements and increase efficiency is to
prune unnecessary components of these often over-parameterized networks.
Previous work has shown that attribution methods from the field of eXplainable
AI serve as effective means to extract and prune the least relevant network
components in a few-shot fashion. We extend the current state by proposing to
explicitly optimize hyperparameters of attribution methods for the task of
pruning, and further include transformer-based networks in our analysis. Our
approach yields higher model compression rates of large transformer- and
convolutional architectures (VGG, ResNet, ViT) compared to previous works,
while still attaining high performance on ImageNet classification tasks. Here,
our experiments indicate that transformers have a higher degree of
over-parameterization compared to convolutional neural networks. Code is
available at https://github.com/erfanhatefi/Pruning-by-eXplaining-in-PyTorch.
comment: Accepted as a workshop paper at ECCV 2024, 26 pages (11 pages
manuscript, 3 pages references, 12 pages appendix)
♻ ★ VILA-U: a Unified Foundation Model Integrating Visual Understanding and Generation
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, Song Han, Yao Lu
VILA-U is a Unified foundation model that integrates Video, Image, Language
understanding and generation. Traditional visual language models (VLMs) use
separate modules for understanding and generating visual content, which can
lead to misalignment and increased complexity. In contrast, VILA-U employs a
single autoregressive next-token prediction framework for both tasks,
eliminating the need for additional components like diffusion models. This
approach not only simplifies the model but also achieves near state-of-the-art
performance in visual language understanding and generation. The success of
VILA-U is attributed to two main factors: the unified vision tower that aligns
discrete visual tokens with textual inputs during pretraining, which enhances
visual perception, and autoregressive image generation can achieve similar
quality as diffusion models with high-quality dataset. This allows VILA-U to
perform comparably to more complex models using a fully token-based
autoregressive framework.
comment: Code: https://github.com/mit-han-lab/vila-u. The first two authors
contributed equally to this work
♻ ☆ JointMotion: Joint Self-Supervision for Joint Motion Prediction
We present JointMotion, a self-supervised pre-training method for joint
motion prediction in self-driving vehicles. Our method jointly optimizes a
scene-level objective connecting motion and environments, and an instance-level
objective to refine learned representations. Scene-level representations are
learned via non-contrastive similarity learning of past motion sequences and
environment context. At the instance level, we use masked autoencoding to
refine multimodal polyline representations. We complement this with an adaptive
pre-training decoder that enables JointMotion to generalize across different
environment representations, fusion mechanisms, and dataset characteristics.
Notably, our method reduces the joint final displacement error of Wayformer,
HPTR, and Scene Transformer models by 3\%, 8\%, and 12\%, respectively; and
enables transfer learning between the Waymo Open Motion and the Argoverse 2
Motion Forecasting datasets. Code: https://github.com/kit-mrt/future-motion
comment: CoRL'24 camera-ready
♻ ☆ Telling Stories for Common Sense Zero-Shot Action Recognition ACCV 2024
Video understanding has long suffered from reliance on large labeled
datasets, motivating research into zero-shot learning. Recent progress in
language modeling presents opportunities to advance zero-shot video analysis,
but constructing an effective semantic space relating action classes remains
challenging. We address this by introducing a novel dataset, Stories, which
contains rich textual descriptions for diverse action classes extracted from
WikiHow articles. For each class, we extract multi-sentence narratives
detailing the necessary steps, scenes, objects, and verbs that characterize the
action. This contextual data enables modeling of nuanced relationships between
actions, paving the way for zero-shot transfer. We also propose an approach
that harnesses Stories to improve feature generation for training zero-shot
classification. Without any target dataset fine-tuning, our method achieves new
state-of-the-art on multiple benchmarks, improving top-1 accuracy by up to
6.1%. We believe Stories provides a valuable resource that can catalyze
progress in zero-shot action recognition. The textual narratives forge
connections between seen and unseen classes, overcoming the bottleneck of
labeled data that has long impeded advancements in this exciting domain. The
data can be found here: https://github.com/kini5gowda/Stories .
comment: Accepted in ACCV 2024!
♻ ☆ Generalizable Prompt Tuning for Vision-Language Models
Prompt tuning for vision-language models such as CLIP involves optimizing the
text prompts used to generate image-text pairs for specific downstream tasks.
While hand-crafted or template-based prompts are generally applicable to a
wider range of unseen classes, they tend to perform poorly in downstream tasks
(i.e., seen classes). Learnable soft prompts, on the other hand, often perform
well in downstream tasks but lack generalizability. Additionally, prior
research has predominantly concentrated on the textual modality, with very few
studies attempting to explore the prompt's generalization potential from the
visual modality. Keeping these limitations in mind, we investigate how to
prompt tuning to obtain both a competitive downstream performance and
generalization. The study shows that by treating soft and hand-crafted prompts
as dual views of the textual modality, and maximizing their mutual information,
we can better ensemble task-specific and general semantic information.
Moreover, to generate more expressive prompts, the study introduces a
class-wise augmentation from the visual modality, resulting in significant
robustness to a wider range of unseen classes. Extensive evaluations on several
benchmarks report that the proposed approach achieves competitive results in
terms of both task-specific performance and general abilities.
comment: in progress
♻ ☆ Exploring the Adversarial Robustness of CLIP for AI-generated Image Detection
In recent years, many forensic detectors have been proposed to detect
AI-generated images and prevent their use for malicious purposes. Convolutional
neural networks (CNNs) have long been the dominant architecture in this field
and have been the subject of intense study. However, recently proposed
Transformer-based detectors have been shown to match or even outperform
CNN-based detectors, especially in terms of generalization. In this paper, we
study the adversarial robustness of AI-generated image detectors, focusing on
Contrastive Language-Image Pretraining (CLIP)-based methods that rely on Visual
Transformer (ViT) backbones and comparing their performance with CNN-based
methods. We study the robustness to different adversarial attacks under a
variety of conditions and analyze both numerical results and frequency-domain
patterns. CLIP-based detectors are found to be vulnerable to white-box attacks
just like CNN-based detectors. However, attacks do not easily transfer between
CNN-based and CLIP-based methods. This is also confirmed by the different
distribution of the adversarial noise patterns in the frequency domain.
Overall, this analysis provides new insights into the properties of forensic
detectors that can help to develop more effective strategies.
♻ ☆ Leveraging Hallucinations to Reduce Manual Prompt Dependency in Promptable Segmentation NeurIPS 2024
Promptable segmentation typically requires instance-specific manual prompts
to guide the segmentation of each desired object. To minimize such a need,
task-generic promptable segmentation has been introduced, which employs a
single task-generic prompt to segment various images of different objects in
the same task. Current methods use Multimodal Large Language Models (MLLMs) to
reason detailed instance-specific prompts from a task-generic prompt for
improving segmentation accuracy. The effectiveness of this segmentation heavily
depends on the precision of these derived prompts. However, MLLMs often suffer
hallucinations during reasoning, resulting in inaccurate prompting. While
existing methods focus on eliminating hallucinations to improve a model, we
argue that MLLM hallucinations can reveal valuable contextual insights when
leveraged correctly, as they represent pre-trained large-scale knowledge beyond
individual images. In this paper, we utilize hallucinations to mine
task-related information from images and verify its accuracy for enhancing
precision of the generated prompts. Specifically, we introduce an iterative
Prompt-Mask Cycle generation framework (ProMaC) with a prompt generator and a
mask generator.The prompt generator uses a multi-scale chain of thought
prompting, initially exploring hallucinations for extracting extended
contextual knowledge on a test image.These hallucinations are then reduced to
formulate precise instance-specific prompts, directing the mask generator to
produce masks that are consistent with task semantics by mask semantic
alignment. The generated masks iteratively induce the prompt generator to focus
more on task-relevant image areas and reduce irrelevant hallucinations,
resulting jointly in better prompts and masks. Experiments on 5 benchmarks
demonstrate the effectiveness of ProMaC. Code given in
https://lwpyh.github.io/ProMaC/.
comment: NeurIPS 2024
♻ ☆ LocoMotion: Learning Motion-Focused Video-Language Representations ACCV 2024
This paper strives for motion-focused video-language representations.
Existing methods to learn video-language representations use spatial-focused
data, where identifying the objects and scene is often enough to distinguish
the relevant caption. We instead propose LocoMotion to learn from
motion-focused captions that describe the movement and temporal progression of
local object motions. We achieve this by adding synthetic motions to videos and
using the parameters of these motions to generate corresponding captions.
Furthermore, we propose verb-variation paraphrasing to increase the caption
variety and learn the link between primitive motions and high-level verbs. With
this, we are able to learn a motion-focused video-language representation.
Experiments demonstrate our approach is effective for a variety of downstream
tasks, particularly when limited data is available for fine-tuning. Code is
available: https://hazeldoughty.github.io/Papers/LocoMotion/
comment: ACCV 2024 Oral
♻ ☆ Accessible, At-Home Detection of Parkinson's Disease via Multi-task Video Analysis
Md Saiful Islam, Tariq Adnan, Jan Freyberg, Sangwu Lee, Abdelrahman Abdelkader, Meghan Pawlik, Cathe Schwartz, Karen Jaffe, Ruth B. Schneider, E Ray Dorsey, Ehsan Hoque
Limited accessibility to neurological care leads to underdiagnosed
Parkinson's Disease (PD), preventing early intervention. Existing AI-based PD
detection methods primarily focus on unimodal analysis of motor or speech
tasks, overlooking the multifaceted nature of the disease. To address this, we
introduce a large-scale, multi-task video dataset consisting of 1102 sessions
(each containing videos of finger tapping, facial expression, and speech tasks
captured via webcam) from 845 participants (272 with PD). We propose a novel
Uncertainty-calibrated Fusion Network (UFNet) that leverages this multimodal
data to enhance diagnostic accuracy. UFNet employs independent task-specific
networks, trained with Monte Carlo Dropout for uncertainty quantification,
followed by self-attended fusion of features, with attention weights
dynamically adjusted based on task-specific uncertainties. To ensure
patient-centered evaluation, the participants were randomly split into three
sets: 60% for training, 20% for model selection, and 20% for final performance
evaluation. UFNet significantly outperformed single-task models in terms of
accuracy, area under the ROC curve (AUROC), and sensitivity while maintaining
non-inferior specificity. Withholding uncertain predictions further boosted the
performance, achieving 88.0+-0.3%$ accuracy, 93.0+-0.2% AUROC, 79.3+-0.9%
sensitivity, and 92.6+-0.3% specificity, at the expense of not being able to
predict for 2.3+-0.3% data (+- denotes 95% confidence interval). Further
analysis suggests that the trained model does not exhibit any detectable bias
across sex and ethnic subgroups and is most effective for individuals aged
between 50 and 80. Requiring only a webcam and microphone, our approach
facilitates accessible home-based PD screening, especially in regions with
limited healthcare resources.
♻ ☆ SCA: Highly Efficient Semantic-Consistent Unrestricted Adversarial Attack
Deep neural network based systems deployed in sensitive environments are
vulnerable to adversarial attacks. Unrestricted adversarial attacks typically
manipulate the semantic content of an image (e.g., color or texture) to create
adversarial examples that are both effective and photorealistic. Recent works
have utilized the diffusion inversion process to map images into a latent
space, where high-level semantics are manipulated by introducing perturbations.
However, they often results in substantial semantic distortions in the denoised
output and suffers from low efficiency. In this study, we propose a novel
framework called Semantic-Consistent Unrestricted Adversarial Attacks (SCA),
which employs an inversion method to extract edit-friendly noise maps and
utilizes Multimodal Large Language Model (MLLM) to provide semantic guidance
throughout the process. Under the condition of rich semantic information
provided by MLLM, we perform the DDPM denoising process of each step using a
series of edit-friendly noise maps, and leverage DPM Solver++ to accelerate
this process, enabling efficient sampling with semantic consistency. Compared
to existing methods, our framework enables the efficient generation of
adversarial examples that exhibit minimal discernible semantic changes.
Consequently, we for the first time introduce Semantic-Consistent Adversarial
Examples (SCAE). Extensive experiments and visualizations have demonstrated the
high efficiency of SCA, particularly in being on average 12 times faster than
the state-of-the-art attacks. Our research can further draw attention to the
security of multimedia information.
♻ ☆ PnLCalib: Sports Field Registration via Points and Lines Optimization
Camera calibration in broadcast sports videos presents numerous challenges
for accurate sports field registration due to multiple camera angles, varying
camera parameters, and frequent occlusions of the field. Traditional
search-based methods depend on initial camera pose estimates, which can
struggle in non-standard positions and dynamic environments. In response, we
propose an optimization-based calibration pipeline that leverages a 3D soccer
field model and a predefined set of keypoints to overcome these limitations.
Our method also introduces a novel refinement module that improves initial
calibration by using detected field lines in a non-linear optimization process.
This approach outperforms existing techniques in both multi-view and
single-view 3D camera calibration tasks, while maintaining competitive
performance in homography estimation. Extensive experimentation on real-world
soccer datasets, including SoccerNet-Calibration, WorldCup 2014, and
TS-WorldCup, highlights the robustness and accuracy of our method across
diverse broadcast scenarios. Our approach offers significant improvements in
camera calibration precision and reliability.
comment: Extended version of "No Bells, Just Whistles: Sports Field
Registration Leveraging Geometric Properties"
♻ ☆ PixLore: A Dataset-driven Approach to Rich Image Captioning
Diego Bonilla-Salvador, Marcelino Martínez-Sober, Joan Vila-Francés, Antonio José Serrano-López, Pablo Rodríguez-Belenguer, Fernando Mateo
In the domain of vision-language integration, generating detailed image
captions poses a significant challenge due to the lack of curated and rich
datasets. This study introduces PixLore, a novel method that leverages Querying
Transformers through the fine-tuning of the BLIP-2 model using the LoRa method
on a standard commercial GPU. The followed approach, which involves training on
a carefully assembled dataset from state-of-the-art Computer Vision models
combined and augmented by ChatGPT, addresses the question of whether intricate
image understanding can be achieved with an ensemble of smaller-scale models,
referred to as Knowledge Stitching. Comparative evaluations against major
models such as GPT-4 and Google Bard demonstrate that PixLore-2.7B, despite
having considerably fewer parameters, is rated higher than the existing
State-of-the-Art models in over half of the assessments. Precisely, PixLore
outperform Bard and BLIP-2, which score approximately 35.18% and 27.98% lower
than PixLore in the task of image captioning. This research not only presents a
groundbreaking approach but also highlights the importance of well-curated
datasets in enhancing the performance of smaller models.
comment: Paper in preprint pending of publication
♻ ☆ Denoising Diffusion Models for Inpainting of Healthy Brain Tissue MICCAI
This paper is a contribution to the "BraTS 2023 Local Synthesis of Healthy
Brain Tissue via Inpainting Challenge". The task of this challenge is to
transform tumor tissue into healthy tissue in brain magnetic resonance (MR)
images. This idea originates from the problem that MR images can be evaluated
using automatic processing tools, however, many of these tools are optimized
for the analysis of healthy tissue. By solving the given inpainting task, we
enable the automatic analysis of images featuring lesions, and further
downstream tasks. Our approach builds on denoising diffusion probabilistic
models. We use a 2D model that is trained using slices in which healthy tissue
was cropped out and is learned to be inpainted again. This allows us to use the
ground truth healthy tissue during training. In the sampling stage, we replace
the slices containing diseased tissue in the original 3D volume with the slices
containing the healthy tissue inpainting. With our approach, we achieve
comparable results to the competing methods. On the validation set our model
achieves a mean SSIM of 0.7804, a PSNR of 20.3525 and a MSE of 0.0113. In
future we plan to extend our 2D model to a 3D model, allowing to inpaint the
region of interest as a whole without losing context information of neighboring
slices.
comment: 12 pages, 5 figures, MICCAI challenge submission
♻ ☆ A Multimodal Fusion Network For Student Emotion Recognition Based on Transformer and Tensor Product
This paper introduces a new multi-modal model based on the Transformer
architecture and tensor product fusion strategy, combining BERT's text vectors
and ViT's image vectors to classify students' psychological conditions, with an
accuracy of 93.65%. The purpose of the study is to accurately analyze the
mental health status of students from various data sources. This paper
discusses modal fusion methods, including early, late and intermediate fusion,
to overcome the challenges of integrating multi-modal information. Ablation
studies compare the performance of different models and fusion techniques,
showing that the proposed model outperforms existing methods such as CLIP and
ViLBERT in terms of accuracy and inference speed. Conclusions indicate that
while this model has significant advantages in emotion recognition, its
potential to incorporate other data modalities provides areas for future
research.
♻ ☆ Towards Croppable Implicit Neural Representations NeurIPS 2024
Implicit Neural Representations (INRs) have peaked interest in recent years
due to their ability to encode natural signals using neural networks. While
INRs allow for useful applications such as interpolating new coordinates and
signal compression, their black-box nature makes it difficult to modify them
post-training. In this paper we explore the idea of editable INRs, and
specifically focus on the widely used cropping operation. To this end, we
present Local-Global SIRENs -- a novel INR architecture that supports cropping
by design. Local-Global SIRENs are based on combining local and global feature
extraction for signal encoding. What makes their design unique is the ability
to effortlessly remove specific portions of an encoded signal, with a
proportional weight decrease. This is achieved by eliminating the corresponding
weights from the network, without the need for retraining. We further show how
this architecture can be used to support the straightforward extension of
previously encoded signals. Beyond signal editing, we examine how the
Local-Global approach can accelerate training, enhance encoding of various
signals, improve downstream performance, and be applied to modern INRs such as
INCODE, highlighting its potential and flexibility. Code is available at
https://github.com/maorash/Local-Global-INRs.
comment: Accepted to NeurIPS 2024
♻ ☆ Breaking Class Barriers: Efficient Dataset Distillation via Inter-Class Feature Compensator
Dataset distillation has emerged as a technique aiming to condense
informative features from large, natural datasets into a compact and synthetic
form. While recent advancements have refined this technique, its performance is
bottlenecked by the prevailing class-specific synthesis paradigm. Under this
paradigm, synthetic data is optimized exclusively for a pre-assigned one-hot
label, creating an implicit class barrier in feature condensation. This leads
to inefficient utilization of the distillation budget and oversight of
inter-class feature distributions, which ultimately limits the effectiveness
and efficiency, as demonstrated in our analysis. To overcome these constraints,
this paper presents the Inter-class Feature Compensator (INFER), an innovative
distillation approach that transcends the class-specific data-label framework
widely utilized in current dataset distillation methods. Specifically, INFER
leverages a Universal Feature Compensator (UFC) to enhance feature integration
across classes, enabling the generation of multiple additional synthetic
instances from a single UFC input. This significantly improves the efficiency
of the distillation budget. Moreover, INFER enriches inter-class interactions
during the distillation, thereby enhancing the effectiveness and
generalizability of the distilled data. By allowing for the linear
interpolation of labels similar to those in the original dataset, INFER
meticulously optimizes the synthetic data and dramatically reduces the size of
soft labels in the synthetic dataset to almost zero, establishing a new
benchmark for efficiency and effectiveness in dataset distillation.
♻ ☆ Exploring Stronger Transformer Representation Learning for Occluded Person Re-Identification
Due to some complex factors (e.g., occlusion, pose variation and diverse
camera perspectives), extracting stronger feature representation in person
re-identification remains a challenging task. In this paper, we proposed a
novel self-supervision and supervision combining transformer-based person
re-identification framework, namely SSSC-TransReID. Different from the general
transformer-based person re-identification models, we designed a
self-supervised contrastive learning branch, which can enhance the feature
representation for person re-identification without negative samples or
additional pre-training. In order to train the contrastive learning branch, we
also proposed a novel random rectangle mask strategy to simulate the occlusion
in real scenes, so as to enhance the feature representation for occlusion.
Finally, we utilized the joint-training loss function to integrate the
advantages of supervised learning with ID tags and self-supervised contrastive
learning without negative samples, which can reinforce the ability of our model
to excavate stronger discriminative features, especially for occlusion.
Extensive experimental results on several benchmark datasets show our proposed
model obtains superior Re-ID performance consistently and outperforms the
state-of-the-art ReID methods by large margins on the mean average accuracy
(mAP) and Rank-1 accuracy.
♻ ☆ From Real Artifacts to Virtual Reference: A Robust Framework for Translating Endoscopic Images
Domain adaptation, which bridges the distributions across different
modalities, plays a crucial role in multimodal medical image analysis. In
endoscopic imaging, combining pre-operative data with intra-operative imaging
is important for surgical planning and navigation. However, existing domain
adaptation methods are hampered by distribution shift caused by in vivo
artifacts, necessitating robust techniques for aligning noisy and artifact
abundant patient endoscopic videos with clean virtual images reconstructed from
pre-operative tomographic data for pose estimation during intraoperative
guidance. This paper presents an artifact-resilient image translation method
and an associated benchmark for this purpose. The method incorporates a novel
``local-global'' translation framework and a noise-resilient feature extraction
strategy. For the former, it decouples the image translation process into a
local step for feature denoising, and a global step for global style transfer.
For feature extraction, a new contrastive learning strategy is proposed, which
can extract noise-resilient features for establishing robust correspondence
across domains. Detailed validation on both public and in-house clinical
datasets has been conducted, demonstrating significantly improved performance
compared to the current state-of-the-art.
♻ ☆ Few-Shot Adversarial Prompt Learning on Vision-Language Models NeurIPS 2024
The vulnerability of deep neural networks to imperceptible adversarial
perturbations has attracted widespread attention. Inspired by the success of
vision-language foundation models, previous efforts achieved zero-shot
adversarial robustness by aligning adversarial visual features with text
supervision. However, in practice, they are still unsatisfactory due to several
issues, including heavy adaptation cost, suboptimal text supervision, and
uncontrolled natural generalization capacity. In this paper, to address these
issues, we propose a few-shot adversarial prompt framework where adapting input
sequences with limited data makes significant adversarial robustness
improvement. Specifically, we achieve this by providing adversarially
correlated text supervision that is end-to-end learned from adversarial
examples. We also propose a novel training objective that enhances the
consistency of multi-modal features while encourages differentiated uni-modal
features between natural and adversarial examples. The proposed framework gives
access to learn adversarial text supervision, which provides superior
cross-modal adversarial alignment and matches state-of-the-art zero-shot
adversarial robustness with only 1% training data. Code is available at:
https://github.com/lionel-w2/FAP.
comment: NeurIPS 2024
♻ ☆ Enhancing Interaction Modeling with Agent Selection and Physical Coefficient for Trajectory Prediction SP
A thorough understanding of the interaction between the target agent and
surrounding agents is a prerequisite for accurate trajectory prediction.
Although many methods have been explored, they all assign correlation
coefficients to surrounding agents in a purely learning-based manner. In this
study, we present ASPILin, which manually selects interacting agents and
calculates their correlations instead of attention scores. Surprisingly, these
simple modifications can significantly improve prediction performance and
substantially reduce computational costs. Additionally, ASPILin models the
interacting agents at each past time step separately, rather than only modeling
the interacting agents at the current time step. This clarifies the causal
chain of the target agent's historical trajectory and helps the model better
understand dynamic interactions. We intentionally simplified our model in other
aspects, such as map encoding. Remarkably, experiments conducted on the
INTERACTION, highD, and CitySim datasets demonstrate that our method is
efficient and straightforward, outperforming other state-of-the-art methods.
comment: code:https://github.com/kkk00714/ASPILin
♻ ☆ Conquering the Communication Constraints to Enable Large Pre-Trained Models in Federated Learning
Federated learning (FL) has emerged as a promising paradigm for enabling the
collaborative training of models without centralized access to the raw data on
local devices. In the typical FL paradigm (e.g., FedAvg), model weights are
sent to and from the server each round to participating clients. Recently, the
use of small pre-trained models has been shown effective in federated learning
optimization and improving convergence. However, recent state-of-the-art
pre-trained models are getting more capable but also have more parameters. In
conventional FL, sharing the enormous model weights can quickly put a massive
communication burden on the system, especially if more capable models are
employed. Can we find a solution to enable those strong and readily-available
pre-trained models in FL to achieve excellent performance while simultaneously
reducing the communication burden? To this end, we investigate the use of
parameter-efficient fine-tuning in federated learning and thus introduce a new
framework: FedPEFT. Specifically, we systemically evaluate the performance of
FedPEFT across a variety of client stability, data distribution, and
differential privacy settings. By only locally tuning and globally sharing a
small portion of the model weights, significant reductions in the total
communication overhead can be achieved while maintaining competitive or even
better performance in a wide range of federated learning scenarios, providing
insight into a new paradigm for practical and effective federated systems.
♻ ☆ Latent Noise Segmentation: How Neural Noise Leads to the Emergence of Segmentation and Grouping ICML 2024
Humans are able to segment images effortlessly without supervision using
perceptual grouping. Here, we propose a counter-intuitive computational
approach to solving unsupervised perceptual grouping and segmentation: that
they arise because of neural noise, rather than in spite of it. We (1)
mathematically demonstrate that under realistic assumptions, neural noise can
be used to separate objects from each other; (2) that adding noise in a DNN
enables the network to segment images even though it was never trained on any
segmentation labels; and (3) that segmenting objects using noise results in
segmentation performance that aligns with the perceptual grouping phenomena
observed in humans, and is sample-efficient. We introduce the Good Gestalt (GG)
datasets -- six datasets designed to specifically test perceptual grouping, and
show that our DNN models reproduce many important phenomena in human
perception, such as illusory contours, closure, continuity, proximity, and
occlusion. Finally, we (4) show that our model improves performance on our GG
datasets compared to other tested unsupervised models by $24.9\%$. Together,
our results suggest a novel unsupervised segmentation method requiring few
assumptions, a new explanation for the formation of perceptual grouping, and a
novel potential benefit of neural noise.
comment: ICML 2024 camera ready version
♻ ☆ STBA: Towards Evaluating the Robustness of DNNs for Query-Limited Black-box Scenario
Many attack techniques have been proposed to explore the vulnerability of
DNNs and further help to improve their robustness. Despite the significant
progress made recently, existing black-box attack methods still suffer from
unsatisfactory performance due to the vast number of queries needed to optimize
desired perturbations. Besides, the other critical challenge is that
adversarial examples built in a noise-adding manner are abnormal and struggle
to successfully attack robust models, whose robustness is enhanced by
adversarial training against small perturbations. There is no doubt that these
two issues mentioned above will significantly increase the risk of exposure and
result in a failure to dig deeply into the vulnerability of DNNs. Hence, it is
necessary to evaluate DNNs' fragility sufficiently under query-limited settings
in a non-additional way. In this paper, we propose the Spatial Transform
Black-box Attack (STBA), a novel framework to craft formidable adversarial
examples in the query-limited scenario. Specifically, STBA introduces a flow
field to the high-frequency part of clean images to generate adversarial
examples and adopts the following two processes to enhance their naturalness
and significantly improve the query efficiency: a) we apply an estimated flow
field to the high-frequency part of clean images to generate adversarial
examples instead of introducing external noise to the benign image, and b) we
leverage an efficient gradient estimation method based on a batch of samples to
optimize such an ideal flow field under query-limited settings. Compared to
existing score-based black-box baselines, extensive experiments indicated that
STBA could effectively improve the imperceptibility of the adversarial examples
and remarkably boost the attack success rate under query-limited settings.
comment: Accepted by T-MM
♻ ☆ ODTFormer: Efficient Obstacle Detection and Tracking with Stereo Cameras Based on Transformer IROS 2024
Obstacle detection and tracking represent a critical component in robot
autonomous navigation. In this paper, we propose ODTFormer, a Transformer-based
model to address both obstacle detection and tracking problems. For the
detection task, our approach leverages deformable attention to construct a 3D
cost volume, which is decoded progressively in the form of voxel occupancy
grids. We further track the obstacles by matching the voxels between
consecutive frames. The entire model can be optimized in an end-to-end manner.
Through extensive experiments on DrivingStereo and KITTI benchmarks, our model
achieves state-of-the-art performance in the obstacle detection task. We also
report comparable accuracy to state-of-the-art obstacle tracking models while
requiring only a fraction of their computation cost, typically ten-fold to
twenty-fold less. The code and model weights will be publicly released.
comment: 8 pages. Accepted by IROS 2024
♻ ☆ Meteor: Mamba-based Traversal of Rationale for Large Language and Vision Models
The rapid development of large language and vision models (LLVMs) has been
driven by advances in visual instruction tuning. Recently, open-source LLVMs
have curated high-quality visual instruction tuning datasets and utilized
additional vision encoders or multiple computer vision models in order to
narrow the performance gap with powerful closed-source LLVMs. These
advancements are attributed to multifaceted information required for diverse
capabilities, including fundamental image understanding, real-world knowledge
about common-sense and non-object concepts (e.g., charts, diagrams, symbols,
signs, and math problems), and step-by-step procedures for solving complex
questions. Drawing from the multifaceted information, we present a new
efficient LLVM, Mamba-based traversal of rationales (Meteor), which leverages
multifaceted rationale to enhance understanding and answering capabilities. To
embed lengthy rationales containing abundant information, we employ the Mamba
architecture, capable of processing sequential data with linear time
complexity. We introduce a new concept of traversal of rationale that
facilitates efficient embedding of rationale. Subsequently, the backbone
multimodal language model (MLM) is trained to generate answers with the aid of
rationale. Through these steps, Meteor achieves significant improvements in
vision language performances across multiple evaluation benchmarks requiring
diverse capabilities, without scaling up the model size or employing additional
vision encoders and computer vision models.
comment: Code is available in https://github.com/ByungKwanLee/Meteor
♻ ☆ LLaVA-MoD: Making LLaVA Tiny via MoE Knowledge Distillation
Fangxun Shu, Yue Liao, Le Zhuo, Chenning Xu, Lei Zhang, Guanghao Zhang, Haonan Shi, Long Chen, Tao Zhong, Wanggui He, Siming Fu, Haoyuan Li, Bolin Li, Zhelun Yu, Si Liu, Hongsheng Li, Hao Jiang
We introduce LLaVA-MoD, a novel framework designed to enable the efficient
training of small-scale Multimodal Language Models (s-MLLM) by distilling
knowledge from large-scale MLLM (l-MLLM). Our approach tackles two fundamental
challenges in MLLM distillation. First, we optimize the network structure of
s-MLLM by integrating a sparse Mixture of Experts (MoE) architecture into the
language model, striking a balance between computational efficiency and model
expressiveness. Second, we propose a progressive knowledge transfer strategy to
ensure comprehensive knowledge migration. This strategy begins with mimic
distillation, where we minimize the Kullback-Leibler (KL) divergence between
output distributions to enable the student model to emulate the teacher
network's understanding. Following this, we introduce preference distillation
via Direct Preference Optimization (DPO), where the key lies in treating l-MLLM
as the reference model. During this phase, the s-MLLM's ability to discriminate
between superior and inferior examples is significantly enhanced beyond l-MLLM,
leading to a better student that surpasses its teacher, particularly in
hallucination benchmarks. Extensive experiments demonstrate that LLaVA-MoD
outperforms existing models across various multimodal benchmarks while
maintaining a minimal number of activated parameters and low computational
costs. Remarkably, LLaVA-MoD, with only 2B activated parameters, surpasses
Qwen-VL-Chat-7B by an average of 8.8% across benchmarks, using merely 0.3% of
the training data and 23% trainable parameters. These results underscore
LLaVA-MoD's ability to effectively distill comprehensive knowledge from its
teacher model, paving the way for the development of more efficient MLLMs. The
code will be available on: https://github.com/shufangxun/LLaVA-MoD.
♻ ☆ DIP-Watermark: A Double Identity Protection Method Based on Robust Adversarial Watermark
The wide deployment of Face Recognition (FR) systems poses privacy risks. One
countermeasure is adversarial attack, deceiving unauthorized malicious FR, but
it also disrupts regular identity verification of trusted authorizers,
exacerbating the potential threat of identity impersonation. To address this,
we propose the first double identity protection scheme based on traceable
adversarial watermarking, termed DIP-Watermark. DIP-Watermark employs a
one-time watermark embedding to deceive unauthorized FR models and allows
authorizers to perform identity verification by extracting the watermark.
Specifically, we propose an information-guided adversarial attack against FR
models. The encoder embeds an identity-specific watermark into the deep feature
space of the carrier, guiding recognizable features of the image to deviate
from the source identity. We further adopt a collaborative meta-optimization
strategy compatible with sub-tasks, which regularizes the joint optimization
direction of the encoder and decoder. This strategy enhances the representation
of universal carrier features, mitigating multi-objective optimization
conflicts in watermarking. Experiments confirm that DIP-Watermark achieves
significant attack success rates and traceability accuracy on state-of-the-art
FR models, exhibiting remarkable robustness that outperforms the existing
privacy protection methods using adversarial attacks and deep watermarking, or
simple combinations of the two. Our work potentially opens up new insights into
proactive protection for FR privacy.
♻ ☆ Hierarchical Light Transformer Ensembles for Multimodal Trajectory Forecasting
Accurate trajectory forecasting is crucial for the performance of various
systems, such as advanced driver-assistance systems and self-driving vehicles.
These forecasts allow to anticipate events leading to collisions and,
therefore, to mitigate them. Deep Neural Networks have excelled in motion
forecasting, but issues like overconfidence and uncertainty quantification
persist. Deep Ensembles address these concerns, yet applying them to multimodal
distributions remains challenging. In this paper, we propose a novel approach
named Hierarchical Light Transformer Ensembles (HLT-Ens), aimed at efficiently
training an ensemble of Transformer architectures using a novel hierarchical
loss function. HLT-Ens leverages grouped fully connected layers, inspired by
grouped convolution techniques, to capture multimodal distributions,
effectively. Through extensive experimentation, we demonstrate that HLT-Ens
achieves state-of-the-art performance levels, offering a promising avenue for
improving trajectory forecasting techniques.
comment: acknowledgement added
♻ ☆ Pulling Target to Source: A New Perspective on Domain Adaptive Semantic Segmentation
Domain adaptive semantic segmentation aims to transfer knowledge from a
labeled source domain to an unlabeled target domain. However, existing methods
primarily focus on directly learning qualified target features, making it
challenging to guarantee their discrimination in the absence of target labels.
This work provides a new perspective. We observe that the features learned with
source data manage to keep categorically discriminative during training,
thereby enabling us to implicitly learn adequate target representations by
simply \textbf{pulling target features close to source features for each
category}. To this end, we propose T2S-DA, which we interpret as a form of
pulling Target to Source for Domain Adaptation, encouraging the model in
learning similar cross-domain features. Also, considering the pixel categories
are heavily imbalanced for segmentation datasets, we come up with a dynamic
re-weighting strategy to help the model concentrate on those underperforming
classes. Extensive experiments confirm that T2S-DA learns a more discriminative
and generalizable representation, significantly surpassing the
state-of-the-art. We further show that our method is quite qualified for the
domain generalization task, verifying its domain-invariant property.
comment: Accepted by IJCV
♻ ☆ Improving Text Generation on Images with Synthetic Captions
The recent emergence of latent diffusion models such as SDXL and SD 1.5 has
shown significant capability in generating highly detailed and realistic
images. Despite their remarkable ability to produce images, generating accurate
text within images still remains a challenging task. In this paper, we examine
the validity of fine-tuning approaches in generating legible text within the
image. We propose a low-cost approach by leveraging SDXL without any
time-consuming training on large-scale datasets. The proposed strategy employs
a fine-tuning technique that examines the effects of data refinement levels and
synthetic captions. Moreover, our results demonstrate how our small scale
fine-tuning approach can improve the accuracy of text generation in different
scenarios without the need of additional multimodal encoders. Our experiments
show that with the addition of random letters to our raw dataset, our model's
performance improves in producing well-formed visual text.
comment: 2024 16th IIAI International Congress on Advanced Applied Informatics
(IIAI-AAI)
♻ ☆ Harmonizing Visual Text Comprehension and Generation NeurIPS 2024
Zhen Zhao, Jingqun Tang, Binghong Wu, Chunhui Lin, Shu Wei, Hao Liu, Xin Tan, Zhizhong Zhang, Can Huang, Yuan Xie
In this work, we present TextHarmony, a unified and versatile multimodal
generative model proficient in comprehending and generating visual text.
Simultaneously generating images and texts typically results in performance
degradation due to the inherent inconsistency between vision and language
modalities. To overcome this challenge, existing approaches resort to
modality-specific data for supervised fine-tuning, necessitating distinct model
instances. We propose Slide-LoRA, which dynamically aggregates
modality-specific and modality-agnostic LoRA experts, partially decoupling the
multimodal generation space. Slide-LoRA harmonizes the generation of vision and
language within a singular model instance, thereby facilitating a more unified
generative process. Additionally, we develop a high-quality image caption
dataset, DetailedTextCaps-100K, synthesized with a sophisticated closed-source
MLLM to enhance visual text generation capabilities further. Comprehensive
experiments across various benchmarks demonstrate the effectiveness of the
proposed approach. Empowered by Slide-LoRA, TextHarmony achieves comparable
performance to modality-specific fine-tuning results with only a 2% increase in
parameters and shows an average improvement of 2.5% in visual text
comprehension tasks and 4.0% in visual text generation tasks. Our work
delineates the viability of an integrated approach to multimodal generation
within the visual text domain, setting a foundation for subsequent inquiries.
Code is available at https://github.com/bytedance/TextHarmony.
comment: accepted by NeurIPS 2024
♻ ☆ Lightning-Fast Image Inversion and Editing for Text-to-Image Diffusion Models
Dvir Samuel, Barak Meiri, Haggai Maron, Yoad Tewel, Nir Darshan, Shai Avidan, Gal Chechik, Rami Ben-Ari
Diffusion inversion is the problem of taking an image and a text prompt that
describes it and finding a noise latent that would generate the exact same
image. Most current deterministic inversion techniques operate by approximately
solving an implicit equation and may converge slowly or yield poor
reconstructed images. We formulate the problem by finding the roots of an
implicit equation and devlop a method to solve it efficiently. Our solution is
based on Newton-Raphson (NR), a well-known technique in numerical analysis. We
show that a vanilla application of NR is computationally infeasible while
naively transforming it to a computationally tractable alternative tends to
converge to out-of-distribution solutions, resulting in poor reconstruction and
editing. We therefore derive an efficient guided formulation that fastly
converges and provides high-quality reconstructions and editing. We showcase
our method on real image editing with three popular open-sourced diffusion
models: Stable Diffusion, SDXL-Turbo, and Flux with different deterministic
schedulers. Our solution, Guided Newton-Raphson Inversion, inverts an image
within 0.4 sec (on an A100 GPU) for few-step models (SDXL-Turbo and Flux.1),
opening the door for interactive image editing. We further show improved
results in image interpolation and generation of rare objects.
♻ ☆ Improving Instance Optimization in Deformable Image Registration with Gradient Projection MICCAI 2024
Deformable image registration is inherently a multi-objective optimization
(MOO) problem, requiring a delicate balance between image similarity and
deformation regularity. These conflicting objectives often lead to poor
optimization outcomes, such as being trapped in unsatisfactory local minima or
experiencing slow convergence. Deep learning methods have recently gained
popularity in this domain due to their efficiency in processing large datasets
and achieving high accuracy. However, they often underperform during test time
compared to traditional optimization techniques, which further explore
iterative, instance-specific gradient-based optimization. This performance gap
is more pronounced when a distribution shift between training and test data
exists. To address this issue, we focus on the instance optimization (IO)
paradigm, which involves additional optimization for test-time instances based
on a pre-trained model. IO effectively combines the generalization capabilities
of deep learning with the fine-tuning advantages of instance-specific
optimization. Within this framework, we emphasize the use of gradient
projection to mitigate conflicting updates in MOO. This technique projects
conflicting gradients into a common space, better aligning the dual objectives
and enhancing optimization stability. We validate our method using a
state-of-the-art foundation model on the 3D Brain inter-subject registration
task (LUMIR) from the Learn2Reg 2024 Challenge. Our results show significant
improvements over standard gradient descent, leading to more accurate and
reliable registration results.
comment: Learn2Reg Challenge at MICCAI 2024
♻ ☆ ERX: A Fast Real-Time Anomaly Detection Algorithm for Hyperspectral Line Scanning
Detecting unexpected objects (anomalies) in real time has great potential for
monitoring, managing, and protecting the environment. Hyperspectral line-scan
cameras are a low-cost solution that enhance confidence in anomaly detection
over RGB and multispectral imagery. However, existing line-scan algorithms are
too slow when using small computers (e.g. those onboard a drone or small
satellite), do not adapt to changing scenery, or lack robustness against
geometric distortions. This paper introduces the Exponentially moving RX
algorithm (ERX) to address these issues, and compares it with existing RX-based
anomaly detection methods for hyperspectral line scanning. Three large and more
complex datasets are also introduced to better assess the practical challenges
when using line-scan cameras (two hyperspectral and one multispectral). ERX is
evaluated using a Jetson Xavier NX compute module, achieving the best
combination of speed and detection performance. This research paves the way for
future studies in grouping and locating anomalous objects, adaptive and
automatic threshold selection, and real-time field tests. The datasets and the
Python code are available at: https://github.com/WiseGamgee/HyperAD.
comment: 17 pages, 13 figures, 4 tables, code and datasets accessible at
https://github.com/WiseGamgee/HyperAD
♻ ☆ CAT: Contrastive Adapter Training for Personalized Image Generation CVPR
The emergence of various adapters, including Low-Rank Adaptation (LoRA)
applied from the field of natural language processing, has allowed diffusion
models to personalize image generation at a low cost. However, due to the
various challenges including limited datasets and shortage of regularization
and computation resources, adapter training often results in unsatisfactory
outcomes, leading to the corruption of the backbone model's prior knowledge.
One of the well known phenomena is the loss of diversity in object generation,
especially within the same class which leads to generating almost identical
objects with minor variations. This poses challenges in generation
capabilities. To solve this issue, we present Contrastive Adapter Training
(CAT), a simple yet effective strategy to enhance adapter training through the
application of CAT loss. Our approach facilitates the preservation of the base
model's original knowledge when the model initiates adapters. Furthermore, we
introduce the Knowledge Preservation Score (KPS) to evaluate CAT's ability to
keep the former information. We qualitatively and quantitatively compare CAT's
improvement. Finally, we mention the possibility of CAT in the aspects of
multi-concept adapter and optimization.
comment: CVPRW 2024
♻ ☆ Gaussian-Informed Continuum for Physical Property Identification and Simulation NeurIPS 2024
This paper studies the problem of estimating physical properties (system
identification) through visual observations. To facilitate geometry-aware
guidance in physical property estimation, we introduce a novel hybrid framework
that leverages 3D Gaussian representation to not only capture explicit shapes
but also enable the simulated continuum to render object masks as 2D shape
surrogates during training.
We propose a new dynamic 3D Gaussian framework based on motion factorization
to recover the object as 3D Gaussian point sets across different time states.
Furthermore, we develop a coarse-to-fine filling strategy to generate the
density fields of the object from the Gaussian reconstruction, allowing for the
extraction of object continuums along with their surfaces and the integration
of Gaussian attributes into these continuums.
In addition to the extracted object surfaces, the Gaussian-informed continuum
also enables the rendering of object masks during simulations, serving as
2D-shape guidance for physical property estimation.
Extensive experimental evaluations demonstrate that our pipeline achieves
state-of-the-art performance across multiple benchmarks and metrics.
Additionally, we illustrate the effectiveness of the proposed method through
real-world demonstrations, showcasing its practical utility.
Our project page is at https://jukgei.github.io/project/gic.
comment: 21 pages, 8 figures, NeurIPS 2024
♻ ☆ Toward Fairer Face Recognition Datasets
Face recognition and verification are two computer vision tasks whose
performance has progressed with the introduction of deep representations.
However, ethical, legal, and technical challenges due to the sensitive
character of face data and biases in real training datasets hinder their
development. Generative AI addresses privacy by creating fictitious identities,
but fairness problems persist. We promote fairness by introducing a demographic
attributes balancing mechanism in generated training datasets. We experiment
with an existing real dataset, three generated training datasets, and the
balanced versions of a diffusion-based dataset. We propose a comprehensive
evaluation that considers accuracy and fairness equally and includes a rigorous
regression-based statistical analysis of attributes. The analysis shows that
balancing reduces demographic unfairness. Also, a performance gap persists
despite generation becoming more accurate with time. The proposed balancing
method and comprehensive verification evaluation promote fairer and transparent
face recognition and verification.
♻ ☆ LVBench: An Extreme Long Video Understanding Benchmark
Weihan Wang, Zehai He, Wenyi Hong, Yean Cheng, Xiaohan Zhang, Ji Qi, Xiaotao Gu, Shiyu Huang, Bin Xu, Yuxiao Dong, Ming Ding, Jie Tang
Recent progress in multimodal large language models has markedly enhanced the
understanding of short videos (typically under one minute), and several
evaluation datasets have emerged accordingly. However, these advancements fall
short of meeting the demands of real-world applications such as embodied
intelligence for long-term decision-making, in-depth movie reviews and
discussions, and live sports commentary, all of which require comprehension of
long videos spanning several hours. To address this gap, we introduce LVBench,
a benchmark specifically designed for long video understanding. Our dataset
comprises publicly sourced videos and encompasses a diverse set of tasks aimed
at long video comprehension and information extraction. LVBench is designed to
challenge multimodal models to demonstrate long-term memory and extended
comprehension capabilities. Our extensive evaluations reveal that current
multimodal models still underperform on these demanding long video
understanding tasks. Through LVBench, we aim to spur the development of more
advanced models capable of tackling the complexities of long video
comprehension. Our data and code are publicly available at:
https://lvbench.github.io.
♻ ☆ Advancing Open-Set Domain Generalization Using Evidential Bi-Level Hardest Domain Scheduler NeurIPS 2024
Kunyu Peng, Di Wen, Kailun Yang, Ao Luo, Yufan Chen, Jia Fu, M. Saquib Sarfraz, Alina Roitberg, Rainer Stiefelhagen
In Open-Set Domain Generalization (OSDG), the model is exposed to both new
variations of data appearance (domains) and open-set conditions, where both
known and novel categories are present at test time. The challenges of this
task arise from the dual need to generalize across diverse domains and
accurately quantify category novelty, which is critical for applications in
dynamic environments. Recently, meta-learning techniques have demonstrated
superior results in OSDG, effectively orchestrating the meta-train and -test
tasks by employing varied random categories and predefined domain partition
strategies. These approaches prioritize a well-designed training schedule over
traditional methods that focus primarily on data augmentation and the
enhancement of discriminative feature learning. The prevailing meta-learning
models in OSDG typically utilize a predefined sequential domain scheduler to
structure data partitions. However, a crucial aspect that remains inadequately
explored is the influence brought by strategies of domain schedulers during
training. In this paper, we observe that an adaptive domain scheduler benefits
more in OSDG compared with prefixed sequential and random domain schedulers. We
propose the Evidential Bi-Level Hardest Domain Scheduler (EBiL-HaDS) to achieve
an adaptive domain scheduler. This method strategically sequences domains by
assessing their reliabilities in utilizing a follower network, trained with
confidence scores learned in an evidential manner, regularized by max rebiasing
discrepancy, and optimized in a bi-level manner. The results show that our
method substantially improves OSDG performance and achieves more discriminative
embeddings for both the seen and unseen categories. The source code is publicly
available at https://github.com/KPeng9510/EBiL-HaDS.
comment: Accepted to NeurIPS 2024. The source code is publicly available at
https://github.com/KPeng9510/EBiL-HaDS
♻ ☆ Learning to Manipulate Anywhere: A Visual Generalizable Framework For Reinforcement Learning
Can we endow visuomotor robots with generalization capabilities to operate in
diverse open-world scenarios? In this paper, we propose \textbf{Maniwhere}, a
generalizable framework tailored for visual reinforcement learning, enabling
the trained robot policies to generalize across a combination of multiple
visual disturbance types. Specifically, we introduce a multi-view
representation learning approach fused with Spatial Transformer Network (STN)
module to capture shared semantic information and correspondences among
different viewpoints. In addition, we employ a curriculum-based randomization
and augmentation approach to stabilize the RL training process and strengthen
the visual generalization ability. To exhibit the effectiveness of Maniwhere,
we meticulously design 8 tasks encompassing articulate objects, bi-manual, and
dexterous hand manipulation tasks, demonstrating Maniwhere's strong visual
generalization and sim2real transfer abilities across 3 hardware platforms. Our
experiments show that Maniwhere significantly outperforms existing
state-of-the-art methods. Videos are provided at
https://gemcollector.github.io/maniwhere/.
comment: Webpage: https://gemcollector.github.io/maniwhere/
♻ ☆ The Ultimate Combo: Boosting Adversarial Example Transferability by Composing Data Augmentations
To help adversarial examples generalize from surrogate machine-learning (ML)
models to targets, certain transferability-based black-box evasion attacks
incorporate data augmentations (e.g., random resizing). Yet, prior work has
explored limited augmentations and their composition. To fill the gap, we
systematically studied how data augmentation affects transferability.
Specifically, we explored 46 augmentation techniques originally proposed to
help ML models generalize to unseen benign samples, and assessed how they
impact transferability, when applied individually or composed. Performing
exhaustive search on a small subset of augmentation techniques and genetic
search on all techniques, we identified augmentation combinations that help
promote transferability. Extensive experiments with the ImageNet and CIFAR-10
datasets and 18 models showed that simple color-space augmentations (e.g.,
color to greyscale) attain high transferability when combined with standard
augmentations. Furthermore, we discovered that composing augmentations impacts
transferability mostly monotonically (i.e., more augmentations $\rightarrow$
$\ge$transferability). We also found that the best composition significantly
outperformed the state of the art (e.g., 91.8% vs. $\le$82.5% average
transferability to adversarially trained targets on ImageNet). Lastly, our
theoretical analysis, backed by empirical evidence, intuitively explains why
certain augmentations promote transferability.
comment: Accepted by AISec'24
♻ ☆ Diffusion Models are Certifiably Robust Classifiers NeurIPS 2024
Generative learning, recognized for its effective modeling of data
distributions, offers inherent advantages in handling out-of-distribution
instances, especially for enhancing robustness to adversarial attacks. Among
these, diffusion classifiers, utilizing powerful diffusion models, have
demonstrated superior empirical robustness. However, a comprehensive
theoretical understanding of their robustness is still lacking, raising
concerns about their vulnerability to stronger future attacks. In this study,
we prove that diffusion classifiers possess $O(1)$ Lipschitzness, and establish
their certified robustness, demonstrating their inherent resilience. To achieve
non-constant Lipschitzness, thereby obtaining much tighter certified
robustness, we generalize diffusion classifiers to classify Gaussian-corrupted
data. This involves deriving the evidence lower bounds (ELBOs) for these
distributions, approximating the likelihood using the ELBO, and calculating
classification probabilities via Bayes' theorem. Experimental results show the
superior certified robustness of these Noised Diffusion Classifiers (NDCs).
Notably, we achieve over 80% and 70% certified robustness on CIFAR-10 under
adversarial perturbations with \(\ell_2\) norms less than 0.25 and 0.5,
respectively, using a single off-the-shelf diffusion model without any
additional data.
comment: Accepted by NeurIPS 2024
♻ ☆ SemiSAM: Enhancing Semi-Supervised Medical Image Segmentation via SAM-Assisted Consistency Regularization
Semi-supervised learning has attracted much attention due to its less
dependence on acquiring abundant annotations from experts compared to fully
supervised methods, which is especially important for medical image
segmentation which typically requires intensive pixel/voxel-wise labeling by
domain experts. Although semi-supervised methods can improve the performance by
utilizing unlabeled data, there are still gaps between fully supervised methods
under extremely limited annotation scenarios. In this paper, we propose a
simple yet efficient strategy to explore the usage of the Segment Anything
Model (SAM) for enhancing semi-supervised medical image segmentation.
Concretely, the segmentation model trained with domain knowledge provides
information for localization and generating input prompts to the SAM. Then the
generated pseudo-labels of SAM are utilized as additional supervision to assist
in the learning procedure of the semi-supervised framework. Extensive
experiments demonstrate that SemiSAM significantly improves the performance of
existing semi-supervised frameworks when only one or a few labeled images are
available and shows strong efficiency as a plug-and-play strategy for
semi-supervised medical image segmentation.
comment: Accept for BIBM 2024
♻ ☆ RotCAtt-TransUNet++: Novel Deep Neural Network for Sophisticated Cardiac Segmentation
Cardiovascular disease remains a predominant global health concern,
responsible for a significant portion of mortality worldwide. Accurate
segmentation of cardiac medical imaging data is pivotal in mitigating fatality
rates associated with cardiovascular conditions. However, existing
state-of-the-art (SOTA) neural networks, including both CNN-based and
Transformer-based approaches, exhibit limitations in practical applicability
due to their inability to effectively capture inter-slice connections alongside
intra-slice information. This deficiency is particularly evident in datasets
featuring intricate, long-range details along the z-axis, such as coronary
arteries in axial views. Additionally, SOTA methods fail to differentiate
non-cardiac components from myocardium in segmentation, leading to the
"spraying" phenomenon. To address these challenges, we present
RotCAtt-TransUNet++, a novel architecture tailored for robust segmentation of
complex cardiac structures. Our approach emphasizes modeling global contexts by
aggregating multiscale features with nested skip connections in the encoder. It
integrates transformer layers to capture interactions between patches and
employs a rotatory attention mechanism to capture connectivity between multiple
slices (inter-slice information). Additionally, a channel-wise cross-attention
gate guides the fused multi-scale channel-wise information and features from
decoder stages to bridge semantic gaps. Experimental results demonstrate that
our proposed model outperforms existing SOTA approaches across four cardiac
datasets and one abdominal dataset. Importantly, coronary arteries and
myocardium are annotated with near-perfect accuracy during inference. An
ablation study shows that the rotatory attention mechanism effectively
transforms embedded vectorized patches in the semantic dimensional space,
enhancing segmentation accuracy.
comment: 11 pages, 11 figures
♻ ☆ Exploring Self-Supervised Skeleton-Based Human Action Recognition under Occlusions
Yifei Chen, Kunyu Peng, Alina Roitberg, David Schneider, Jiaming Zhang, Junwei Zheng, Ruiping Liu, Yufan Chen, Kailun Yang, Rainer Stiefelhagen
To integrate self-supervised skeleton-based action recognition methods into
autonomous robotic systems, it is crucial to consider adverse situations
involving target occlusions. Such a scenario, despite its practical relevance,
is rarely addressed in existing self-supervised skeleton-based action
recognition methods. To empower models with the capacity to address occlusion,
we propose a simple and effective method. We first pre-train using occluded
skeleton sequences, then use k-means clustering (KMeans) on sequence embeddings
to group semantically similar samples. Next, we propose KNN-Imputation to fill
in missing skeleton data based on the closest sample neighbors. Imputing
incomplete skeleton sequences to create relatively complete sequences as input
provides significant benefits to existing skeleton-based self-supervised
methods. Meanwhile, building on the state-of-the-art Partial Spatio-Temporal
Learning (PSTL), we introduce an Occluded Partial Spatio-Temporal Learning
(OPSTL) framework. This enhancement utilizes Adaptive Spatial Masking (ASM) for
better use of high-quality, intact skeletons. The new proposed method is
verified on the challenging occluded versions of the NTURGB+D 60 and NTURGB+D
120. The source code is publicly available at https://github.com/cyfml/OPSTL.
comment: The source code is publicly available at
https://github.com/cyfml/OPSTL
♻ ☆ RealignDiff: Boosting Text-to-Image Diffusion Model with Coarse-to-fine Semantic Re-alignment
Guian Fang, Zutao Jiang, Jianhua Han, Guansong Lu, Hang Xu, Shengcai Liao, Xiaojun Chang, Xiaodan Liang
Recent advances in text-to-image diffusion models have achieved remarkable
success in generating high-quality, realistic images from textual descriptions.
However, these approaches have faced challenges in precisely aligning the
generated visual content with the textual concepts described in the prompts. In
this paper, we propose a two-stage coarse-to-fine semantic re-alignment method,
named RealignDiff, aimed at improving the alignment between text and images in
text-to-image diffusion models. In the coarse semantic re-alignment phase, a
novel caption reward, leveraging the BLIP-2 model, is proposed to evaluate the
semantic discrepancy between the generated image caption and the given text
prompt. Subsequently, the fine semantic re-alignment stage employs a local
dense caption generation module and a re-weighting attention modulation module
to refine the previously generated images from a local semantic view.
Experimental results on the MS-COCO and ViLG-300 datasets demonstrate that the
proposed two-stage coarse-to-fine semantic re-alignment method outperforms
other baseline re-alignment techniques by a substantial margin in both visual
quality and semantic similarity with the input prompt.
♻ ☆ Real-World Robot Applications of Foundation Models: A Review
Recent developments in foundation models, like Large Language Models (LLMs)
and Vision-Language Models (VLMs), trained on extensive data, facilitate
flexible application across different tasks and modalities. Their impact spans
various fields, including healthcare, education, and robotics. This paper
provides an overview of the practical application of foundation models in
real-world robotics, with a primary emphasis on the replacement of specific
components within existing robot systems. The summary encompasses the
perspective of input-output relationships in foundation models, as well as
their role in perception, motion planning, and control within the field of
robotics. This paper concludes with a discussion of future challenges and
implications for practical robot applications.
♻ ☆ MMBench-Video: A Long-Form Multi-Shot Benchmark for Holistic Video Understanding NeurIPS 2024
The advent of large vision-language models (LVLMs) has spurred research into
their applications in multi-modal contexts, particularly in video
understanding. Traditional VideoQA benchmarks, despite providing quantitative
metrics, often fail to encompass the full spectrum of video content and
inadequately assess models' temporal comprehension. To address these
limitations, we introduce MMBench-Video, a quantitative benchmark designed to
rigorously evaluate LVLMs' proficiency in video understanding. MMBench-Video
incorporates lengthy videos from YouTube and employs free-form questions,
mirroring practical use cases. The benchmark is meticulously crafted to probe
the models' temporal reasoning skills, with all questions human-annotated
according to a carefully constructed ability taxonomy. We employ GPT-4 for
automated assessment, demonstrating superior accuracy and robustness over
earlier LLM-based evaluations. Utilizing MMBench-Video, we have conducted
comprehensive evaluations that include both proprietary and open-source LVLMs
for images and videos. MMBench-Video stands as a valuable resource for the
research community, facilitating improved evaluation of LVLMs and catalyzing
progress in the field of video understanding. The evalutation code of
MMBench-Video will be integrated into VLMEvalKit:
https://github.com/open-compass/VLMEvalKit.
comment: Accepted in NeurIPS 2024 Datasets and Benchmarks Track
♻ ☆ Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding EMNLP 2024
Visual arguments, often used in advertising or social causes, rely on images
to persuade viewers to do or believe something. Understanding these arguments
requires selective vision: only specific visual stimuli within an image are
relevant to the argument, and relevance can only be understood within the
context of a broader argumentative structure. While visual arguments are
readily appreciated by human audiences, we ask: are today's AI capable of
similar understanding? We present VisArgs, a dataset of 1,611 images annotated
with 5,112 visual premises (with regions), 5,574 commonsense premises, and
reasoning trees connecting them into structured arguments. We propose three
tasks for evaluating visual argument understanding: premise localization,
premise identification, and conclusion deduction. Experiments show that 1)
machines struggle to capture visual cues: GPT-4-O achieved 78.5% accuracy,
while humans reached 98.0%. Models also performed 19.5% worse when
distinguishing between irrelevant objects within the image compared to external
objects. 2) Providing relevant visual premises improved model performance
significantly.
comment: 12 pages, 6 figures. Accepted as main paper in EMNLP 2024
♻ ☆ Can visual language models resolve textual ambiguity with visual cues? Let visual puns tell you! EMNLP 2024
Humans possess multimodal literacy, allowing them to actively integrate
information from various modalities to form reasoning. Faced with challenges
like lexical ambiguity in text, we supplement this with other modalities, such
as thumbnail images or textbook illustrations. Is it possible for machines to
achieve a similar multimodal understanding capability? In response, we present
Understanding Pun with Image Explanations (UNPIE), a novel benchmark designed
to assess the impact of multimodal inputs in resolving lexical ambiguities.
Puns serve as the ideal subject for this evaluation due to their intrinsic
ambiguity. Our dataset includes 1,000 puns, each accompanied by an image that
explains both meanings. We pose three multimodal challenges with the
annotations to assess different aspects of multimodal literacy; Pun Grounding,
Disambiguation, and Reconstruction. The results indicate that various Socratic
Models and Visual-Language Models improve over the text-only models when given
visual context, particularly as the complexity of the tasks increases.
comment: Accepted as main paper in EMNLP 2024
♻ ☆ CV-VAE: A Compatible Video VAE for Latent Generative Video Models
Spatio-temporal compression of videos, utilizing networks such as Variational
Autoencoders (VAE), plays a crucial role in OpenAI's SORA and numerous other
video generative models. For instance, many LLM-like video models learn the
distribution of discrete tokens derived from 3D VAEs within the VQVAE
framework, while most diffusion-based video models capture the distribution of
continuous latent extracted by 2D VAEs without quantization. The temporal
compression is simply realized by uniform frame sampling which results in
unsmooth motion between consecutive frames. Currently, there lacks of a
commonly used continuous video (3D) VAE for latent diffusion-based video models
in the research community. Moreover, since current diffusion-based approaches
are often implemented using pre-trained text-to-image (T2I) models, directly
training a video VAE without considering the compatibility with existing T2I
models will result in a latent space gap between them, which will take huge
computational resources for training to bridge the gap even with the T2I models
as initialization. To address this issue, we propose a method for training a
video VAE of latent video models, namely CV-VAE, whose latent space is
compatible with that of a given image VAE, e.g., image VAE of Stable Diffusion
(SD). The compatibility is achieved by the proposed novel latent space
regularization, which involves formulating a regularization loss using the
image VAE. Benefiting from the latent space compatibility, video models can be
trained seamlessly from pre-trained T2I or video models in a truly
spatio-temporally compressed latent space, rather than simply sampling video
frames at equal intervals. With our CV-VAE, existing video models can generate
four times more frames with minimal finetuning. Extensive experiments are
conducted to demonstrate the effectiveness of the proposed video VAE.
comment: Project Page: https://ailab-cvc.github.io/cvvae/index.html
♻ ☆ GPHM: Gaussian Parametric Head Model for Monocular Head Avatar Reconstruction
Creating high-fidelity 3D human head avatars is crucial for applications in
VR/AR, digital human, and film production. Recent advances have leveraged
morphable face models to generate animated head avatars from easily accessible
data, representing varying identities and expressions within a low-dimensional
parametric space. However, existing methods often struggle with modeling
complex appearance details, e.g., hairstyles, and suffer from low rendering
quality and efficiency. In this paper we introduce a novel approach, 3D
Gaussian Parametric Head Model, which employs 3D Gaussians to accurately
represent the complexities of the human head, allowing precise control over
both identity and expression. The Gaussian model can handle intricate details,
enabling realistic representations of varying appearances and complex
expressions. Furthermore, we presents a well-designed training framework to
ensure smooth convergence, providing a robust guarantee for learning the rich
content. Our method achieves high-quality, photo-realistic rendering with
real-time efficiency, making it a valuable contribution to the field of
parametric head models. Finally, we apply the 3D Gaussian Parametric Head Model
to monocular video or few-shot head avatar reconstruction tasks, which enables
instant reconstruction of high-quality 3D head avatars even when input data is
extremely limited, surpassing previous methods in terms of reconstruction
quality and training speed.
comment: Project page: https://yuelangx.github.io/gphmv2/
♻ ☆ CoIN: A Benchmark of Continual Instruction tuNing for Multimodel Large Language Model
Instruction tuning represents a prevalent strategy employed by Multimodal
Large Language Models (MLLMs) to align with human instructions and adapt to new
tasks. Nevertheless, MLLMs encounter the challenge of adapting to users'
evolving knowledge and demands. Therefore, how to retain existing skills while
acquiring new knowledge needs to be investigated. In this paper, we present a
comprehensive benchmark, namely Continual Instruction tuNing (CoIN), to assess
existing MLLMs in the sequential instruction tuning paradigm. CoIN comprises 10
commonly used datasets spanning 8 task categories, ensuring a diverse range of
instructions and tasks. Besides, the trained model is evaluated from two
aspects: Instruction Following and General Knowledge, which assess the
alignment with human intention and knowledge preserved for reasoning,
respectively. Experiments on CoIN demonstrate that current powerful MLLMs still
suffer catastrophic forgetting, and the failure in intention alignment assumes
the main responsibility, instead of the knowledge forgetting. To this end, we
introduce MoELoRA to MLLMs which is effective to retain the previous
instruction alignment. Experimental results consistently illustrate the
forgetting decreased from this method on CoIN.
♻ ☆ CD-NGP: A Fast Scalable Continual Representation for Dynamic Scenes
Current methodologies for novel view synthesis (NVS) in dynamic scenes
encounter significant challenges in harmonizing memory consumption, model
complexity, training efficiency, and rendering fidelity. Existing offline
techniques, while delivering high-quality results, are often characterized by
substantial memory demands and limited scalability. In contrast, online methods
grapple with the challenge of balancing rapid convergence with model
compactness. To address these issues, we propose continual dynamic neural
graphics primitives (CD-NGP). Our approach synergizes features from both
temporal and spatial hash encodings to achieve high rendering quality, employs
parameter reuse to enhance scalability, and leverages a continual learning
framework to mitigate memory overhead. Furthermore, we introduce a novel
dataset comprising multi-view, exceptionally long video sequences with
substantial rigid and non-rigid motion, thereby substantiating the scalability
of our method.
comment: new template, editing
♻ ☆ Hybrid Spatial Representations for Species Distribution Modeling SDM
We address an important problem in ecology called Species Distribution
Modeling (SDM), whose goal is to predict whether a species exists at a certain
position on Earth. In particular, we tackle a challenging version of this task,
where we learn from presence-only data in a community-sourced dataset, model a
large number of species simultaneously, and do not use any additional
environmental information. Previous work has used neural implicit
representations to construct models that achieve promising results. However,
implicit representations often generate predictions of limited spatial
precision. We attribute this limitation to their inherently global formulation
and inability to effectively capture local feature variations. This issue is
especially pronounced with presence-only data and a large number of species. To
address this, we propose a hybrid embedding scheme that combines both implicit
and explicit embeddings. Specifically, the explicit embedding is implemented
with a multiresolution hashgrid, enabling our models to better capture local
information. Experiments demonstrate that our results exceed other works by a
large margin on various standard benchmarks, and that the hybrid representation
is better than both purely implicit and explicit ones. Qualitative
visualizations and comprehensive ablation studies reveal that our hybrid
representation successfully addresses the two main challenges. Our code is
open-sourced at https://github.com/Shiran-Yuan/HSR-SDM.
comment: Project codebase https://github.com/Shiran-Yuan/HSR-SDM